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I 

Abstract 

We consider high-dimensional quadratic classifiers in non-sparse settings. The 
target of classification rules is not Bayes error rates in the context. The classifier 
based on the Mahalanobis distance does not always give a preferable performance 
even if the populations are normal distributions having known covariance matrices. 

The quadratic classifiers proposed in this paper draw information about heterogeneity 
effectively through both the differences of expanding mean vectors and covariance 
matrices. We show that they hold a consistency property in which misclassification 
rates tend to zero as the dimension goes to infinity under non-sparse settings. We 
verify that they are asymptotically distributed as a normal distribution under certain 
conditions. We also propose a quadratic classifier after feature selection by using 
both the differences of mean vectors and covariance matrices. Finally, we discuss 
performances of the classifiers in actual data analyses. The proposed classifiers achieve 
highly accurate classification with very low computational costs. 

Keywords: Bayes error rate; Discriminant analysis; Feature selection; Heterogeneity; 

Large p small n 

1 Introduction 

Globally, there is an ever increasing need for fast, accurate and cost effective analysis of 
high-dimensional data in many fields, including academia, medicine and business. How¬ 
ever, existing classifiers for high-dimensional data are often complex, time consuming and 
have no guarantee of accuracy. In this paper we hope to provide better options. A common 
feature of high-dimensional data is that the data dimension is high, however, the sample 
size is relatively low. This is the so-called “HDLSS” or “large p, small n” data situation 
where p/n —>• oo; here p is the data dimension and n is the sample size. Suppose we have 
independent and p-variate two populations, n t , i = 1,2, having an unknown mean vector 
/+ = (fin, ...,pi p ) T and unknown covariance matrix £j(> O ) for each i. Let 

AH 2 = Hi — /^2 = (mi 21 > •••) /U 2 p) T and £12 = 51 1 — £ 2 - 

We assume that liriisup rwoo |//l 2 y | < 00 for all j. Note that limsupp^.^ ||Hi 2 l| 2 /p < 00 , 
where || • || denotes the Euclidean norm. Let a^ be the j- th diagonal element of 
for j = 1 ,...,p (i = 1,2). We assume that o+j) £ (0, 00 ) as p —> 00 for all i,j. Here, 
for a function, /(•), “f(p) £ (0, 00 ) as p —>■ 00 ” implies that lirriinfp^oo f(p) > 0 and 
limsup p-^oofip) < 00 . Then, it holds that tr(£,;)/p £ (0, 00 ) as p —> 00 for i = 1,2. We 
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do not assume Si = E 2 . The eigen-decomposition of Ej is given by Ej = HiAiHf, 
where A,- = diag(A,i, Xi p ) is a diagonal matrix of eigenvalues, Xu > • • • > Xi p > 0, 
and Hi = [hn, ...,hi p \ is an orthogonal matrix of the corresponding eigenvectors. We 
have independent and identically distributed (i.i.d.) observations, Xu ,..., Xi n ., from each 
7 Tj, where xik = (xnk, ■■■,Xi P k) T , k = l,...,nj. We assume n, > 2, i = 1,2. Let n m i n = 
min{m,n 2 }. We estimate /x, and E* by x ini = (xn ni ,..., x ipni ) T = YJk=\ x ik/ni and 
S ini = YJk=\{ x ik ~ Xim)(xik ~ x ini ) T /(rii - 1). Let s in .(j) be the j-th diagonal element of 
S ini for J = L -,P (* = l, 2 )- 

In this paper, we consider high-dimensional quadratic classifiers in non-sparse settings. 
Let *0 = (soi, ...,xo P ) T be an observation vector of an individual belonging to one of the 
two populations. Let \M\ be the determinant of a square matrix M. When 7TjS are 
Gaussian, a Bayes optimal rule is given as follows: One classifies the individual into 7Ti if 

(*o - Mi) T S] _1 (*o - Mi) - lo g l s 2 ^r 1 | < (*0 - M2 ) Ts: 2' 1 (*o - M2) (!-l) 

and into 7r 2 otherwise. Since /x,s and E ? ;s are unknown, one usually considers the following 
typical classifier: 


(*0 ®lni) S i^i (®0 ®lni) log|5 2 n2*S'i n i| < (®0 * 2722 ) S2 n2 ( x ® x 2 n 2 )- 


-1 


\T 0—1 


The classifier usually converges to the Bayes optimal classifier when n n 


00 while p is 


fixed or n m m /p —> 00 . However, in the HDLSS _context, the inverse matrix of S„ H does 


not exist. When Ei = Em. iBickel and Levinal (120041 ) considered an in verse matrix d efined 


by only diagonal elements of the pooled sample covariance matrix. Fan and Fanl (|2008l ) 
considered a classification after feature selection. iFan. Feng and Tond (120121') proposed the 
regula rized optimal affine discriminant (ROAD). When Ei Y Eo. lDudoit. Fridlvand and Speed 


(2002) considered an inverse matrix defined by only diagonal elements of S 
(2011|) considered using {tr(57 ^ t ; —*■— ] 0-1 


Aoshima and Yata 


i)/p} L I P instead of from a geometrical background 


of dimension p. 

dllU P1UPU&CU LHC ^CVJIIICLIIG 

Hall, Marron and Neeman 

(2005) and 

id C . J. p UCliUtCO liiC 1U.C1J 

Marron, Todd and Ahn 

lily 1 

(2007 

Id II1A. 

) con- 

sidered distance weighted classifiers. 

Chan and Hall (2009) and 

Aoshima and Yata 

2014) 

considered distance-based classifiers and 

Aoshima and Yata 

(2014) gave the misclassifica- 


tion rate adjusted classifier for multiclass, high-dimensional data whose misclassification 

rates are no more tha n sp ecified t hresholds. 

Recently, Cai and Liul ( 201 il l. Shao et al. (2011) and Li and Shao ( 2015h gave sparse 

linear or quadratic classification rules for high-dimensional data. They showed that their 

classification rules have Bayes error rates when 7TjS are Gaussian. They assumed that 

XijS are bounded under some sparsity conditions such as Mi 2 , 53jS an d ^12 (or Ej“ 1 s and 

Sj) 1 — S^" 1 ) are sparse. For example, when Ei = E 2 (= E, say), the error rate of their 

classification rules is given by d>(—A^^/2) + o(l) as p —> 00 , where A md = Mi 2 ^ -1 Mi 2 

that is the Mahalanobis distance and <£(•) denotes the cumulative distribution function of 

the standard normal distribution. Here, d>(— A^ ID /2) is the Bayes error rate. 

In this paper, we investigate quadratic classifiers from a perspective that is different 

from the sparse discriminant analysis. We do not assume that /x 12 , EjS and Si 2 are 

sparse. In such a context, the target of classification rules is not Bayes error rates as in 
1/2 

<h(-A A / fjD /2)+o(l) as p —> 00 . We consider a consistency property such as misclassification 
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rates tend to 0 as p increases, i.e., 


e(i) —> 0 as p —>■ oc for z = 1, 2, 


where e(z) denotes the error rate of misclassifying an individual from 7Tj into the other 
class. For example, if one can assume that 7TjS are Gaussian and Si = S 2 , the Bayes rule 
by (HU) has such a consistency property when A md — > 00 as p —> 00. It is likely that 
A md -> 00 as p -MX) when p i2 is non-sparse in the sense that ||/x 12 || —> 00 as p —> 00. 
We emphasize t hat such non-sparse situations of ten occur in high-dimensional settings. 
For example, see lHall. Marron and Neeman ( 20051 ) or ( 16 . 11 ) . ( 16 . 21 ) and Table[2]in Section 6. 
We will show that quadratic classifiers hold the consistency property when /i 12 or Si 2 is 
non-sparse such as H/^ll —> 00 or ||Si 2 ||.f —> 00 as p —> 00, where 11 - 11^ is the Frobenius 
norm. 

In this paper, we consider the following function of Aj to discriminate 7TjS in general: 


(®0 ®mi) A{(x 0 tr(iSj ni Aj)/zzj log |Aj 


( 1 . 2 ) 


where Aj is a positive definite matrix satisfying the equation that tr{Sj(Aj/ — Aj)} = 
tr(A~ 1 A i /) — p (z i'). Here, tr(<Sj ni Afi/ni is a bias correction term. We consider a 
quadratic classification rule in which one classifies the individual into ni if 


W\(Ai) — W 2 (A 2 ) < 0 (1.3) 

and into 7r 2 otherwise. Note that ESI) becomes a linear classifier when Ai = A 2 . We 
have that £'{H / j/(Aj/)} — E{Wi{Af)} = Aj when xq € 7r*, where 

A i = nl 2 Ai,n 12 + tr{Sj(Aj/ - Aj)} + log \A~ X A. t | (1.4) 

for i = 1 , 2 (z' ^ z). 

Proposition 1.1. (i) Aj > 0. (ii) Aj > 0 when p 1 7 ^ p, 2 or A\ A 2 . 

Remark 1. As fori (> 3 )-class classification, one may consider a classification rule such 
as one classifies the individual into 1 q if 


argmin Wj'(Aj/) = i. 

In this paper, we specially consider the following four typical AjS in (1 1.2 1) : 

(I) Aj = I p , (II) Aj = —|-yJ p , (HI) Aj = £-i } , and (IV) Aj = Sr 1 , 

where Sj( d ) = diag(<Tj/ 1 ),..., <7j/ p )). These four AjS satisfy the condition that tr{Sj(Aj/ — 
Aj)} = tr(Ar 1 Aj/) — p (i i') and they provide historical background of discriminant 
analysis. Note that ||Si 2 ||f > HA ^ 1 — A^Hf for these four AjS. Also, under (I) to (IV), 
we note that Aj — > 00 as p —» 00 when p 12 or Si 2 is non-sparse. Practically, AjS should 
be estimated except for (I). We will consider quadratic classifiers given by estimating AjS 
in Section 4. Now, let us see an easy example to check the performance of (I) to (IV) 
in EH). We set p = 2 s , s = 3,..., 12. Independent pseudo random observations were 
generated from 7 Tj : lVp(pj,Sj), i = 1,2. We set p 1 = 0 and Si = Bi(0.3^~^ 1/3 )Bi, 
where B\ = diag[{0.5 + 1 /{p + l)} 1 / 2 , {0.5 +p/{p + l)} 1 / 2 ]. Note that tr(Si) = p and 

Sq^) = B\. When Si = S 2 and (ni,n 2 ) = (log 2 p, 21og 2 p), we considered two cases: 
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Figure 1: The average error rates of the classification rule by (11.31) for (I) to (IV) when 
Si = S 2 . The left and right panels display e in the cases of (a) and (b), respectively. 



(c) S 2 = 1.5Si (d) S 2 = 1.2 I p 

Figure 2: The average error rates of the classification rule by (11.31) for (I) to (IV) when 
pi 1 = M 2 - The left and right panels display e in the cases of (c) and (d), respectively. 


(a) /x 2 = (1,..., 1,0, ...,0) T whose first [~p 2//3 ~| elements are 1, and 

(b) n 2 = (0,..., 0,1,..., l) T whose last [p 2 / 3 ] elements are 1. 

Here, \x\ denotes the smallest integer > x. Next, when fi 2 = 0 (i.e., fi 12 = 0 ) and 
(n i,n 2 ) = (5,10), we considered two cases: 

(c) S 2 = 1.5Si and (d) S 2 = 1.2 I p . 

Note that /i 12 or Si 2 is non-sparse for (a) to (d) because ||/x 12 || —> 00 or ||Si 2 ||f — > 00 as 
p 00 . For xq G 7 Ti {i = 1,2) we repeated 2000 times to confirm if the classification rule 
by (11.31) with either of (I) to (IV) does (or does not) classify xq correctly and defined = 
0 (or 1) accordingly for each 7 ^. We calculated the error rates, e(i) = ^r=°i° -FW/2000, 
i = 1,2. Also, we calculated the average error rate, e = {e(l) + e(2)}/2. Their standard 
deviations are less than 0.011. In Figured! we plotted e for (a) and (b). Note that (I) is 
equivalent to (II) for (a) and (b). In Figure [2l we plotted e for (c) and (d). We observed 
that (IV) gives the worst performance in Figure |T] contrary to expectations. In general, 
one would think that the classifier based on the Mahalanobis distance such as (11.21) with 
(IV) is the best when 7TjS are Gaussian and n m - m —> 00 . We emphasize that it is not true for 
high-dimensional data. We will explain its theoretical reason in Section 3.2. We observed 
that (I) (or (II)) gives a better performance compared to (III) for (b) in Figure HI We 
will discuss the reasons in Section 3.4. In Figure O the error rates of (I) are close to 0.5 
because of fx 12 = 0 . On the other hand, (II), (III) and (IV) gave good performances as 
p increases by drawing information on heteroscedasticity in the classifiers. We will give 
their theoretical backgrounds in Sections 2.2 and 3.4. 

We pay special attention to the difference of covariance matrices in classification for 
high-dimensional data. In Section 2, we show that the classification rule by (11.31) holds 
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the consistency property under non-sparse settings. In Section 3, we verify that the 
quadratic classifier by ( 11 . 21 ) is asymptotically distributed as a normal distribution under 
certain conditions. In Section 4, we consider the estimation of AjS and give asymptotic 
properties of estimated classifiers. In Section 5, we propose a quadratic classifier after 
feature selection by using both the differences of mean vectors and covariance matrices. 
In Section 6 , we discuss performances of the classifiers in actual data analyses. Finally, in 
Section 7, we give concluding remarks of our study. 


2 Consistency property of the quadratic classifier 

In this section, we discuss the consistency property of quadratic classifiers given by Ob 


2.1 Preliminary 


Similar to Bai and Saranadasa (1996) and Aoshima and Yata (2QlII), we assume the fol¬ 
lowing assumption about population distributions as necessary: 


(A-i) Let y ik , k = 1,..., 77 , be i.i.d. random ^-vectors having E(y ik ) = 0 and Var (y ik ) = 
I qi for each i (= 1,2), where c/j > p. Let y ik = ( yn k , ■■■,yi qi k) T whose components 
satisfy that limsup p ^ f00 E{y^ k ) < oo for all j and 

E (yijkVirk) = E (yijk) E (Virk) = 1 and E iyijkyirkyiskyitk) = o ( 2 . 1 ) 

for all j 7 ^ r, s, t. Then, the observations, Xi k s, from each 77 (i = 1, 2) are given by 


x ik = r iy ik + Hi, k = 1 , 

where T, = [ 7 ^, is a p x qi matrix such that r,rf = Xj. 

Note that F f includes the case that T, = HjA 1 /" = [A */ 2 /7 1 ,..., A \ p 2 hi p \. We assume the 
following assumption instead of (A-i) as necessary: 

(A-ii) (A-i) by replacing (12.11) with the independence of yij k , j = 1,..., q% (i = 1,2; k = 
1 ,..., rii). 


Note that (A-ii) is a special case of (A-i). When 77 has N p (y i , Xj), (A-ii) naturally holds. 
Now, we consider the following divergence condition for p and 77 s: 

(*) p —> 00 either when 77 is fixed or 77 —> 00 for i = 1 , 2 . 

Let AiA = yj‘ 2 A-i' X,;A,/ y {2 for i = 1,2 (i* 7 ^ i). We consider the following conditions 
under (*) for i = 1,2 (i' 7 ^ i): 

tr- tr {( s *A) 2 } m , tr^Aj/Xj/Aj/)+tr{(Xj/Aj/) 2 }/77j/ 

(C-i) -772- = °( 1 ) and -^772- = °( 1 )’ 




rii' A 2 


tr* »\ J tr[{Xj(Ai - A2)} 2 ] /lA 

(C-11) -£T = °( 1 )> and (C-m) - ^2 -= °( 1 )' 


Then, we claim the consistency property of (11.21) in (11.31) as follows: 
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Theorem 2.1. Assume (A-i). Assume also (C-i) to (C-iii). Then, we have that 

W ^ _|_ Qp (^ under (*) when x$ G 7 r* for i = 1,2 {%' / i). 

^i 

Furthermore, for the classification rule by (EH with (ED, we have that 

e(i) —> 0 , i = 1 , 2 , under (*). ( 2 - 2 ) 

Remark 2. When A\ = A 2 , we can claim Theorem \2. 1\ without (A-i) and (C-iii). 

Let A m i n (M) and A max (AL) be the smallest and the largest eigenvalues of any positive 
definite matrix, M. We use the phrase “A (M) € (0,cx)) as p —» 00 ” in the sense that 
liminfp^oo A min (M) > 0 and limsupp^oo X max (M) < 00 . We note that AjS in (I) to (III) 
satisfy the condition “A (Aj) G (0, 00 ) as p —> 00 ”. Let A m i n = min{Ai,A 2 }, A max = 
max{A max (Ei), A max (S 2 )} and tr(S 2 iax ) = max{tr(£f), tr(S|)}. Now, instead of (C-i) 
and (C-ii), we consider the following simpler conditions under (*): 

(C -i’) = o(l) and (C-ii’) 1= = o(l). 

n m inA m ; n A m i n 

Proposition 2.1. Assume that limsupp^oo A max (Aj) < 00 fori = 1,2. Then, (C-i’) and 
(C-ii’) imply (C-i) and (C-ii), respectively. Furthermore, if A(A.j) G (0, 00 ) as p —>• 00 
for i = 1,2, and Ai, i = 1,2, are diagonal matrices such as in (I) to (III) in Section 1, 
(C-ii’) implies (C-iii). 

From the fact that A max (Sj) < t^S 2 ) 1 / 2 for i = 1,2, we note that (C-i’) and (C-ii’) 
hold even when n m j n is fixed under 

M S max)/Amin 0 US p -> OO. (2.3) 


2.2 Consistency property for (I) to (IV) 

As mentioned in Section 1, four typical A*s were specifically selected. For (I), by putting 
Ai = I p ,i = 1, 2, (H2D and (H3D are given as 


IVj(Fp) — ||® 0 


x 


ITli | 


- tr (S irH )/ni 


(2.4) 


and Ai = A 2 = ||p 12 || 2 (hereafter called Ap)). 
For (II), by putting Ai = {p/tr(Sj)}J p , i = 1,2, they are given as 


W i {{p/ti(Y, i )}I p ) = 


p\\xq - x 


and A; = 


tr(S 

p\i) ptr(£ 


,n ‘" MS,nt) + J>log{tr(S ( )/p} 


njtr(Sj 

7) + -r+r** {|§y} (llereafter called *«">>• 


(2.5) 


tr(Sj/) tr(Sj/) 

For (HI), by putting A,; = ST= 1)2, they are given as 

■(X 0 j - X ijni ) 2 s ini(j) 


= £ 

3 = 1 


(X, 


*( 1 ) 


n i a i(j ) 


+ l°g °i(j) 


( 2 . 6 ) 


' ^12 j + a i(j) _ 1 


and Aj = { —- 


3 =1 


(X 


* 0 ') 


(hereafter called A 
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For (IV), by putting Aj = H i 1 , i = 1 ,2, they are given as 

tr^E- 1 ) 


) (®0 ® m ») (®0 ® in *) 


rii 


+Y log A: 

y=i 




(2.7) 


and Aj = /li^E., 1 /^ + tr(E i E i , 1 ) - p + ^log 

i=i 


J 

A?'-? 


(hereafter called A ^iv))- 


We Hrst consider the classifiers by (12.41) to (12.61) . From Theorem 12.II and Proposition 12.11 
we have the following result. 

Corollary 2.1. Assume (C-i’f and (C-ii’). Then, for the classification rule by lil.3\) with 
$2-41 ), we have 112.2 1) . Furthermore, for the classification rule by \1.3\) with \2.5\) or \2.6\) . 
we have 112.2\) under (A-i). 


We note that_ th e clas sifier by (12.41) is equivalent to the distance-based classifier by 
Aoshima and Yata (2014j). Hereafter, we call the classifier by ()2.4I) the “distance-based 
discriminant analysis (DBDA)”. From Corollary 12.11 under (12.31) . the classification rule 
by (11.31) with (12.41) . (12.51) or (12.61) has (12.21) even when r^s are fixed. Note that DBDA has 
the consistency prop erty without (A-i), so that DBDA is quite robust for non-Gaussian 
cases. See Aoshima and Yata ( 2014 ) for details. When Hi = /x 2 , DBDA does not satisfy 
(C-i’) and (C-ii’), on the other hand, the classifier by (12.51) or (12.61) still satisfies them. 

Now, we consider the following condition for = 1,2: 


tr(E 2 )/tr(Ej ) 2 —0 as p —> oo. 


( 2 . 8 ) 


We note that tr(Ef)/tr(Ej ) 2 is a measure of sphericity. Also, note that (I2.8j) is equivalent 
to the condition that “A max (Sj)/tr(E,.) —y 0 as p —> oo”. Under (A-i) and (|2.8I) . from the 
fact that Var(||cco — /ull ) = 0{tr(S 2 )} when xq £ 7 Tj, we have that as p —> oo 

||a?o - Mill = tr(Sj) 1/2 {l + op ( 1 )} when * 0 € 7 T,. 


Thus the centroid data lies near the surface of an expanding sphere. See Hall. Marron and Neeman 
( 20051 ) for details of the geometric representation. We emphasize that the classifier by (|2.5D 
draws information about heteroscedasticity thorough the geometric representation having 
different radii, tr(Ei) 1 // 2 s, of expanding two spheres. Note that tr(Ef) = o(p 2 ) under (12.81) . 

Hence, for the classifier by (12. 5 j) . (| 2.3 1) holds under ()2.8|) and lim inf^oo A min (j/)/p > 0, 


where A 


min(/7) 


= mm 


{A 1(// ), A 2 (//)}. Note that A min(// ) > 0 when tr(Ei) 7 ^ tr(S 2 ) in 


view of Proposition 11,11 If one can assume that liminfp^oo |tr(Ei)/tr(E 2 ) — 11 > 0, it 
follows liminfp^oo A min (/j) /p > 0, so that (12.31) holds under (12.81) . Hence, for the classi¬ 
fication rule by (1 1.3 1) with (12. 5 j) . we have (| 2 . 2 I) even when = p, 2 and n^s are fixed. See 
(II) in Figure [2j The accuracy becomes higher as the difference between tr(Sj)s grows. 

Similarly, for the classifier by (12.61) , it follows that (12.31) holds under (12.81) and lim infp-^oo 
A m in (iii)/p > 0 , where A min (//j) = min{A 1 ( 7 //), A 2(1 ii)}- If one can assume that liminfp^oo 
Yl P j=i \ a i(j)/ a 2 (j) — 1 |/p > 0 , it follows liminfp^oo A min gj 7 )/p > 0 , so that the classifica¬ 
tion rule by (11.31) with (12.61) has (I2.2|) even when = p 2 and n * s are fixed. The classifier 
by (12.61) draws information about heteroscedasticity via the difference of diagonal elements 


7 




















between the two covariance matrices. The accuracy becomes higher as the difference of 
those diagonal elements grows. See (III) in Figure [2} 

Next, we consider the classifier by m- From Theorem 12.11 and Proposition 12.11 we 
have the following result. 

Corollary 2.2. Assume (A-i). Assume also liminfp^oo A m i n (S*) > 0 fori = 1,2. Then, 
for the classification rule by m with 0 - we have \2. 6 A) under (C-i’), (C-ii’) and the 
condition that tr{(I p - EjE); 1 ) 2 } = o(A 2 iin(jy) ) for i = 1,2 (i' / i), where A min(jy) = 

min{A 1 (7y), A 2 (/i/)}- 

When Si ^ E 2 , note that A min (/y) > 0 in view of ProDosition ll.il Then, we have the 
following result. 

Proposition 2.2. When liminfp _ ) . 00 |tr(Xj£)7 1 )/p — 11 > 0 or liminfp^oo^j =1 Wj/^i'j 
— l\/p > 0 (?' 7 ^ i'), it follows that liminfp^oo A i(jv)/p > 0 . 

Note that tr {(I p - EjE); 1 ) 2 } < p + trfiEjEi; 1 ) 2 } = p + 0{tr(E 2 )} = o(p 2 ) under 
(12.81) and liminfp^oo A m i n (Ej/) > 0. Hence, from Corollary 12.21 for the classification rule 
by ( 1_!_AJ ~ with (12 ., ive have (12^^21) under (A. - 1 ), (JilAiJ) , lnnmfp —/P ^ 0 and 
liminfp^oo A m i n (Si) > 0 for i = 1,2. Thus from Proposition 12.21 the accuracy becomes 
higher as the difference of eigenvalues or eigenvectors between the two covariance matrices 
grows. See (IV) in Figure [2j 


3 Asymptotic normality of the quadratic classifier 

In this section, we discuss the asymptotic normality of quadratic classifiers given by (USD- 
We further discuss Bayes error rates for high-dimensional data. 


3.1 Preliminary 

Let 

'tr{(EiAj) 2 } tr(E ? ;Aj/Ej/ A P 


hi = 2 


+ 


,) 11/2 

— + A iA j for i = 1,2 (i 7 ^ i). 


r2 _ .. \T ( a 


Note that 5f = Var[2(aj 0 - nf) {Ai(x ini - /xj - A i /(a; i / n , / - /v + (-l) ? /x 12 )}] for i = 
1,2 ( i' 7 ^ i). Let m = minjp, u m i n }. We assume the following conditions when m —>• 00 
for i = 1,2 {i' 7 ^ i): 

+ tr{(E i /A i /) 2 }/n i / tr{(EiA, t ) 4 } 

( C_1V ) --= °( 1 ) ) - ^4 -= °( 1 ) and 


t r {(Sj A^Ej/Aj/)^} 


rv<5 2 


n 


2 5f 


= o(l); 


lt ~, \ tr[{Ej(Ai - A 2 )} 2 ] A iA 

(C-v) -^-= o(i); and (C-vi) -g- = o(l). 











From (IA.6I) in Appendix, under (A-i), (C-iv) and (C-v), it holds that 
Wi'(Ai') - Wi(Ai) - A i =2(x 0 - /x i )' r | Ai(x ini - /^) 

— Aii(xi' n ., — il v + ( 1)'/Xi 2 ) j* + op(5i) 

as m —> oo when xq £ 7Tj for i = 1,2 (*' ^ i ). Under (C-vi), it holds that (xq — 
H i ) T = op(5i) as m — >• oo when xq £ i r* for i = 1,2 (i' ^ i). Then, we claim the 
asymptotic normality of (11.21) under (A-i) as follows: 


Theorem 3.1. Assume (A-i). Assume also (C-iv) to (C-vi). Then, we have that 

WAAi^-WiiAi)- Ai _ Ar/n 1N 
---=> iv(0,1) as m — >• oo 


(3.1) 


when xq £ 7Tj for i = 1,2 (*' / *), 


where “=$■” denotes the convergence in distribution and N( 0,1) denotes a random variable 
distributed as the standard normal distribution. Furthermore, for the classification rule by 
m with m, it holds that 


e(i) = <1? 


—A* 


+ o(l) as m —> oo for i = 1,2. 


(3.2) 


Let <5 m i n = min{<5i, $ 2 }. Now, instead of (C-iv) to (C-vi), we consider the following 
conditions when m —> oo: 


< 11/^1211 A m ax + fr(S max )/n n 
' n ■ 5 2 


= o(l) and 


A? 




= o(l), 


(C-v’) 


tr{(Ai - A 2 ) 2 }A n 


= o(l), and (C-vi’) 


I lMl211 2 An 

<5 2 ■ 


o(l). 


Proposition 3.1. Assume that limsupp^.^ A max (Aj) < oo fori = 1,2. Then, (C-iv’) and 
(C-vi’) imply (C-iv) and (C-vi), respectively. Furthermore, if A(Aj) £ (0, oo) as p —> oo 
for i = 1,2, and A j, i = 1,2, are diagonal matrices such as in (I) to (III) in Section 1, 
(C-v’) implies (C-v). 


Next, we consider the asymptotic normality of (11.21) under (A-ii). We assume the 
following condition instead of (C-vi) when m —> oo for * = 1,2 (i 1 ^ i): 


(C-vii) 


E?=i(7 

sf 


= o(l). 


Note that E| = a 1a- Thus ( C - vii ) 1 
milder than (C-vi). 


is 


Remark 3. The condition in (C-vii) can be written as a condition concerning eigenvalues 
and eigenvectors. If Tj = Hi Ad , A,; = 51- , * = 1,2, and Si = S 2 , it holds that 
EjLi{7^ A i'/U2} 4 = Ej=i^ 2 and A iA = /rf 2 S ~ l v l2 = Ej=i where = (t*i2 h ij) 2 
/\ij. Hence, the condition “ Ej=i V’jAEjE lA?) 2 -^ 0 as p ^ oo” implies (C-vii). 

Now, we claim the asymptotic normality of (11.21) under (A-ii) as follows: 


Theorem 3.2. Assume (A-ii). Assume also (C-iv), (C-v) and (C-vii). Then, we have 
Furthermore, for the classification rule by hi. 3 1) with m, we have S3. 2 )) . 
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3.2 Bayes error rates 

When considering Theorem 13.21 under the situation that 

tr{(£jAj) 2 }/ni + tr(S, ; Aj/Sj/A,v)/nj/ = o(A iA ) asm-Hx) (3.3) 

for i = 1, 2 (i' / i), one has (|3.2I) as 

e(i) = 4>{—Aj/(2A^ 2 )} + o(l) as m —> oo for i = 1,2. 

Note that <5,;/( 2 A.'^ 2 ) = 1 + o(l) under (13.311 . If Si = S 2 (= S), the ratio Aj/A ,^ 2 has 
a maximum when A\ = A 2 = S _1 . Then, the ratio becomes the Mahalanobis distance 

such as A,;/A -^ 2 = AThe classification rule by (11.3ft with (11.2ft has an error rate 

1 /2 

converging to the Bayes error rate in the sense that e(i) = 4>(—A^^-,/2) + o(l) for i = 1, 2. 
On the other hand, if Si 7 ^ S 2 and 7 TjS are Gaussian, under (C-iii) for (IV), the Bayes 
optimal classifier by m becomes as follows: 

2(X 0 — A t i) T S i , 1 /ii 2 + Op(Aj(/y)) > ( —l)*Aj(jy) 

when xq G 717 (i' / i). Note that Var{(*o — /x i ) :r ST7 1 /Xi 2 } = ^12 (hereafter 

called A iA t IV \) when xq G 7 r* (/ 7^ i) and A i A (iv) is the same as Ai A for (IV). Hence, 

(*o — I uj 7 S^ 7 1 /Lti 2 /A ]a(iv) distributed as V(0,1) when *0 G 7 r* : S,). Then, the 

1 /2 

Bayes error rate becomes e(z) = ${— A i ^ iv ^/(2A.^ IV ^)} + o(l) for i = 1,2, under some 
conditions. 

When considering Theorem 13.21 under the situation that 

p/rii + tr(SjS^ 1 )/nj/ = o(A iA ( IV )) asra->oo (3.4) 

for z = 1,2 (z ; 7 ^ z) ; one can claim that the classification rule by (11.3ft with (12.711 has 
the Bayes error rate asymptotically even when 7 TjS are non-Gaussian. Note that (13.411 is 
equivalent to (13.311 for (IV) and (13.411 usually holds when n m i n —>• 00 while p is fixed or 
p — > 00 but n m i n /p —>• 00 . If (13.4ji is not met, the classifier by (12.7(1 is not optimal. We 
emphasize that (13.41) does not always hold for high-dimensional settings such as n m \ n /p —>• 0 
or n m in/p —>• c (> 0). For example, let us consider the setup of Figured] The condition 
“ p/rii = o(A iA ^ I y)Y’ is not met from the facts that A iA ( IV \ = 0(p 2 / 3 ) and n\ = 712 = 
o(p 1//3 ), so that (13.411 does not hold. On the other hand, (C-iv) to (C-vi) hold, so that 
one can claim the asymptotic normality in Theorem 13.11 Note that (13.4(1 does not hold 
under (C-vi) for (IV). Thus the error rate of the classifier based on the Mahalanobis 
distance does not converge to the Bayes error rate when Theorem 13.11 is claimed. Such 
situations frequently occur in HDLSS settings such as n m i n /p -4 0. This is the reason 
why the classifier based on the Mahalanobis distance does not always give a preferable 
performance for high-dimensional data even when n m i n —>■ 00 , SjS are known and 7 TjS are 
Gaussian. 

3.3 Asymptotic normality for (I) to (IV) 

We consider 6iS for (I) to (IV). For (I), by putting A, = I p , i = 1,2, one has 6i(i 7 ^ i r ) as 

5i = 2 -f—-—— -|--—-—— + (hereafter called <Vn). 

In* rii' J w 
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For (II), by putting A, L = {p/tr(Sj)}/ p , i = 1,2, it is given as 

S^r. + /+,•/' w.\2 


4 = ^{-t 1 + ~') } V2 (hereafler called w- 


For (HI), by putting A % = S^,, i = 1,2, it is given as 


Si = 2- 


•trKSjS.^) 2 } . tr(£j£.,^)£j/5F,^)) 


rii 


+ 


+ 


nv 


^ 12 ^ i’ (d) S i' (d) V 12 } 


1/2 


(hereafter called 5j(///)). 

For (IV), by putting A.; = Sr 1 ,* = 1,2, it is given as 


-i \ 


Si = 2{— + t ^ ^ V 12 1 ^ (hereafter called du/u)). 

t n* rij/ J 

From Theorems EHE2] and Proposition 13.11 we have the following result for (I) to 


(III). 


Corollary 3.1. Assume (C-iv’). Assume either (A-i) and (C-vi’) or (A-ii) and (C-vii). 
Then, for the classification rule bv H1.3\) with B we have 113.2 J) . Furthermore, under 
(C-v’), for the classification rule by \1.3[) with \2.5\) or \2.6\) . we have \3.2\) . 

Remark 4. When tr(Si)/tr(S 2 ) —> 1 as p —>• oo, it holds {S^p /tr(S,;/)}/ S^jj^ = 1 + 
o(l) (i / i'). Note that Aj (//) £r(Sj/)/p > A (/) . It follows that A(/)/^(/) < A 
/or sufficiently large p in \3.2\) . 

From Theorems 13.II and 13.21 and Proposition l3.ll we have the following result for (IV). 

Corollary 3.2. Assume that (C-iv’), liminfp-xx, A m i n (Sj) > 0 and tr{(I p — SiS)/ 1 ) 2 } = 
°(^min(/u)) /° r * = 1 > 2 (*' + i )> where S m in(iv) = min {^i(/u) ,<*2(/v)}- Assume either (A-i) 
and (C-vi’) or (A-ii) and (C-vii). Then, for the classification rule by \1.3\) with {2. 7| ), we 
have 113.2\) . 


3.4 Comparisons of the classifiers 

In this section, we investigate the performance of the classifier in (11.21) for (I) to (IV) by 
using the asymptotic normality. When Si = S 2 , we consider (I), (III) and (IV) in the 
setup of Figure [1] Note that (I), (III) and (IV) satisfy (C-iv) to (C-vi) from the facts 
that n min = o^ 1 / 3 ), A iA = 0(||p 12 || 2 ) = C(p 2/3 ), tr(S j)/p <G (0, 00 ) and tr(Sf) = o(p 2 ) 
as p —> 00 for i = 1,2. Thus, Theorem 13.11 holds for (I), (III) and (IV). We plotted 
the asymptotic error rates, <f>(-A (J) /d 1{/) ), 4>(-A 1(m) /5 1{m) ) and 4>(-A i( IV ) / Si{iv)) 
in Figure [3l From (13.21) . we note that e(l) — e(2) = o(l) when Si = S 2 . Thus, the 
average error rate, e = (e(l) + e(2)}/2, is regarded as an estimate of e(l). We laid e 
for (I), (III) and (IV) by borrowing from Figure [0 We observed that e behaves very 
close to the asymptotic error rate as expected theoretically. We also plotted the Bayes 
error rate, 4>(—A^/,/2). We observed that (IV) does not converge to the Bayes error rate 
when Theorem 13.11 is claimed. See Section 3.2 for the details. As for (I) and (III), the 
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Figure 3: The asymptotic error rates (dashed lines) by 4>(—<h(—A \(in)/^i(iii)) 
and <h(— A-i(iv)/$i(iv))i together with the corresponding e (solid lines) by (12.41) . (12.61) and 

(12.71) in the setup of Figured] The Bayes optimal error rate was given by 4>(— Aj^ d /2). 


difference of the performances depends on the configuration of /i y s and s. When p is 
sufficiently large, we note that A(j) = J2j =i Muj < ^i(in) = Y^j =l / a 2 (f) for (a) and 
A(/) > A lain for (b) because <720) = 0-5 + j/(p + 1), j = both for (a) and (b). 

It follows that A(/)/5 i( j) < A i(m) /<5j (m) for (a) and A (/) /<5 i(/) > A i{III) /5 i(III) for (b). 
Thus (III) is better than (I) for (a), on the other hand, they trade places for (b). 

When Hi ^ S 2 , (II), (III) and (IV) draw information about heteroscedasticity through 
the difference of tr(H,.)s, s or EjS, respectively. We consider them in the setup of 
Figure [2] For (c), note that A^ = 0 but A= A^jjj^ = A iav) > cp for some 
constant c > 0. (II), (III) and (IV) hold the consistency property even when n^s are fixed 
because (C-i) to (C-iii) are satisfied. Actually, in Figure [2] we observed that the three 
classifiers gave preferable performances by using the difference of tr(Hj)s, Hj^s or HjS 
as p increases. For (d), note that the difference of tr(Hj)s is smaller than that for (c). 
Actually, in Figure [2] we observed that (II) gives a worse performance for (d) compared 
to (c). On the other hand, (III) gave a better performance compared to (II) because 
A-i(ni) is sufficiently larger than Afor (d) when p is large. (IV) draws information 
about heteroscedasticity from the difference of the covariance matrices themselves, so that 
it gave the best performance in this case. However, we note that it is quite difficult to 
estimate H^s feasibly for high-dimensional data. See Section 5.2 for the details. 

4 Estimation of the quadratic classifier 

We denote an estimator of A % by Ai. We consider estimating the quadratic classifier by 
Wi(Ai). 

4.1 Preliminary 

Let ll-M’ll = Amf x (AT T AT) for any square matrix M. Let k be a constant such as k = A m ; n 
or k = d m in - We consider the following condition for A,s under (*): 

(C-viii) p\\Ai — Aj\\ = op(k) for i = 1,2. 
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Proposition 4.1. Assume (C-viii). Assume also that A(AQ € (0, oo) as p —>• oo /or 
i = 1,2. Then, we have that 


Wi(Ai) - W 2 (A 2 ) = Wi(Ai) - W 2 (A 2 ) + op(/e) (4.1) 

under (k) when x$ € 7r* for i = 1,2. 

When one chooses A*s as Ai = A 2 (= A), W(A) gives a linear classifier. We consider 
the following condition for A under (*): 

(C-ix) (p/n^ 2 n +p 1 / 2 || / r 12 ||)||A- A|| = o P (n). 

We have the following result. 

Proposition 4.2. Assume (C-ix). Then, we have EZP- 

We note that (C-ix) is milder than (C-viii) from the fact that ||/z 12 || = 0(p C 2 ). Hence, 
we recommend to use a linear classifier such as (J2.4I) or (14.51) . The quadratic classifiers 
should be used when the difference of covariance matrices is considerably large. See Section 
4.3 for the details. 


4.2 Quadratic classifier by A, = {p/tr(Si)}I p 
We consider the classifier by 


Wi({p/tr(S ini )}Ip) = - — + plog{tr(S in .)/p}. (4.2) 

iTy&im) Tli 

Note that 5i = A* = A u IP and A, = {p/tr(Ej)}/ p . Here, A(AQ € (0, oo) as p — > oo 

for i = 1, 2, and (C-viii) naturally holds. From Corollary 12.II and Proposition 14.11 we have 
the following result. 


Corollary 4.1. Assume (A-i). Assume also (C-i’) and (C-iV). Then, for the classification 
rule by (El with (Ef, we have \2.2 1) . 


The classifier by (14.21) is equivalent to the geometric classifier by Aoshima and Yata 
(2011). Hereafter, we call the classifier by (14.21) the “geometrical quadratic discriminant 
analysis (GQDA)”. Similar to Section 2.2, we have (12.21) for GQDA under (A-i) and (|2.3|) 
even when n m \ n is fixed. If one can assume that lirriiiifp^^ |tr(Si)/tr(S 2 ) — 11 > 0, we 
have (12.21) for GQDA under (A-i) and (J2.8D even when n m ; n is fixed and = /r 2 - As for 
the asymptotic normality, by combining Corollary 13. II with Lemma B.3 given in Appendix 
B, we have the following result. 


Corollary 4.2. Assume (C-iv’) and (C-v’). Assume either (A-i) and (C-vi’) or (A- 
ii) and (C-vii). Then, for the classification rule by 111.31) with d/.g[ ), we have \3.2\) 
under (tr(Si)/tr(S 2 ) - l) 2 tr(S^ ax ) = o(n min (5^ in(JJ) ) as m ->• oo, where = 

min{ [5 1 (n),6 2 (ii)}. 
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Now, we compare DBDA with GQDA. We have that 


A(7) 


Hl*ini - x 2n2 \\ 2 ~ tr(Si ni )/ni - tr(S 2re2 )/n 2 and 
P 


tr (Si 


A (J ) + tr(S iw< ) - tr(5j/ n ,,) + tr(Si> n .,) log | 


for i = 1,2 {i' / i). We note that E(A^) = AFrom (13.21) and Remark |4j if 
Aj(//)tr(S , j/ n . / )/p is sufficiently larger than for some i, we recommend to use GQDA. 
Otherwise one may use DBDA free from (A-i). See Corollary 2.1 for the details. 


4.3 Quadratic classifier by A % = SA.^ 

Let 5 ini ( d ) = diag(s ini (i),...,s ini (p)) for i = 1,2. We consider the classifier by 


w^s - 1 


irii(d) > 


E 


^) 2 _i +logS(niM 


3 ini(j) 


m 


(4.3) 


Note that A; = A* = Aj (m) and A* = 

considered the quadratic classifier without the bias correction term. That was called the 
diagonal quadratic discriminant analysis (DQDA). Hereafter, we call the classifier by (14.3ft 
“DQDA-bc”. Let r\^ = Var {(xijk — Pij) 2 } for i = 1,2, and j = 1 (k = 1 , ...,nj). 
Since A* = ST 1 ,,, does not satisfy (C-viii) in that shape, we consider the following 

ITl'i yd) 

assumption: 

(A-iii) 77 ^) € (0, oo) as p —> oo and limsup E{ exp (tifixijk — Pij\ 2 /vj /?))} < oo for some 

p—too 

Uj > 0, * = 1,2, and j = 1, ...,p (A: = 1, ...,71*). 


Dudoit, Fridlyand and Speed (2002) 


Note that (A-iii) holds when tt* has lVp(/u*,Ej) for i = 1,2. From Corollary 12.11 and 
Proposition 14.11 we have the following result. 


Corollary 4.3. Assume (A-i) and (A-iii). Assume also (C-ii’). Then, for the classifica¬ 
tion rule by (ED with we have \2.A) under the condition that 


p 2 log p 


^rnin ^n m (TTT) 


= 0 ( 1 ). 


(4.4) 


Note that (C-i’) holds under (|4.4p . From the fact that Aj(jjn = O(p), it follows that 
n miri logp = o(l) under (14.41) . Similar to Section 2.2, if one can assume that liminfp^oo 
\\Vv 2 \\ 2 /p > 0 or liminfp^oo Y0j=i \ a i(J)/ a 2{j) ~ M/p > 0, DQDA-bc holds (12.2ft under 
(A-i), (A-iii), (I2.8j> and 7t m ’ n logp = o(l). When A min (ttt) is not sufficiently large, say 
A m in (in) = 0(p 1/2 ), we can claim Corollary 14.31 in high-dimension, large-sample-size set¬ 
tings such as 7i m i n /p —y oo. In Section 5, we shall provide a DQDA type classifier by 
feature selection and show that it has the consistency property even when n m -,„ /p 0 
and A min (fff) is not sufficiently large. 
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Next, we consider the pooled sample diagonal matrix, 


^n(d) 


XLi K - 1 ) s 


in, (d) 


Ei=i - 2 


Note that E(S n(d) ) = Ez=i( n * _ 1 ) S i(d)/(Ei=i “ 2 ) (hereafter called S (d) ). When 

it follows that S( rf ) = Sj^, i = 1,2. Let us write S n ( d \ = diag(s n ( 1 ), 




= S 


2(d)> 


s n ( p )) and S( d ) = diag((j( 1) , cr^). We consider the classifier by 


w^s - 1 


n(d) > 


E 

i=i 


(Xoj-Xynj 2 SirnO') 


Ml) 


rijS. 


n{j) 


(4.5) 


We note that the classification rule by 


( 20041 ) and Dudoit. Fridlvand and Speed "fiool) considered the linear classifier without the 
bias correction term. That was called the diagonal linear discriminant analysis (DLDA). 
Hereafter, we call the classifier by (|4.5I) “DLDA-bc”. Although Huang, Tong and Zhao 
ihoicl ) gave bias corrected versions of DLDA and DQDA, they considered a bias correction 
only when 7TjS are Gaussian. We note that Ai = A 2 = Ej=i l J '] 2 j/ a (j) (hereafter called A (nr)) 


with (14.5p becomes a linear classifier. Bickel and Levina 


and A \ = An = E 


(d)' 


Then, by combining Theorem 12.11 with Propositions 12.11 and 14.21 


we have the following result. 


Corollary 4.4. Assume (A-iii). Assume also (C-i’) and (C-ii'). Then, for the classifica¬ 
tion rule by m with ra we have \2.2\) under the condition that 

-^2—0(1). (4-6) 

^min^ [III 1 ) 

Under n m ) n log p = o(l), one may claim that (14.61) is milder than (14.41) if A min ^/j) and 
A( up ) are of the same order. Hence, we recommend to use DQDA-bc when A min (///) is 
considerably larger than A(///')• Otherwise one may use DLDA-bc even when Ej^s are 
not common. We shall improve DQDA-bc by feature selection in Section 5. 


4.4 Quadratic classifier by A, = S ir ^ 

In this section, we consider high-dimension, large-sample-size situations such as n m i n /p —> 
00 asp^oo and discuss the classifier by 

w i( s 7ni) = (*0 - x ini ) T Sf„.(x 0 - x ini ) - p/m + log \S ini \. (4.7) 

Note that Si = fyjy), A, = A i(iV ) and A* = Sr 1 . Let r?j (rs) = Var {{x irk -p ir )(x isk -p. is )} 
for i = 1,2, and r,s = 1 (k = 1 From Theorem 12.11 and Proposition 14.11 we 

have the following result. 


Corollary 4.5. Assume (A-i) and (A-iii). Assume also A(EQ £ (0, 00 ) as p —» 00 and 
liminfp^-oo r/j( rs ) > 0 for all r,s; i = 1,2. Then, for the classification rule by \1.S\) with 
we have \2.2 1) under the conditions that p 1 ^ 2 / A min (/ y , = o(l) and 


p 4 log p 




= 0 ( 1 ). 


min(JV) 


(4.8) 
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From the fact that A njv) = 0(p) when A(Sj) € (0, oo) as p —> oo for i = 1,2, it 
follows that n~^ n p 2 \ogp = o(l) under (|4.8|> . Thus, the classification rule by (11.31) with 
(14.71) can claim the consistency property when n~[ n p 2 log p = o(l). However, the condition 
“n~j n p 2 logp = o(l)” is quite strict for high-dimensional data. In Section 5, we shall 
discuss a classifier by sparse inverse covariance matrix estimation when n m i n /p 0 . 


5 Quadratic classifiers by feature selection and sparse in¬ 
verse covariance matrix estimation 

In this section, we propose a new quadratic classifier by feature selection for (14.31) and 
discuss a quadratic classifier by sparse inverse covariance matrix estimation for (14.71) . 


5.1 Quadratic classifier after feature selection 


We consider applying a variable selection procedure to classification. Fan and Fan (2008j) 
proposed the feature annealed independent rule based on the difference of mean vectors. 
However, we give a different type of feature selection by using both the differences of mean 
vectors and covariance matrices. We have that 


H(m) 


+ A. 


2 (HI) 


E 

3 =1 


0l2j + °l(j) 0l2j + ff 2 (j) _ 2 


°2 (j) 


a m 


Let Qj = {p\ 2j + cr 1 ( J -))/(2cr 2 ( 3 -)) + {p 2 2j + u 2 (j ))/(2o- 1(i) ) - 1 for j = 1 ,...,p. Note that 
Ai(///) + A 2 ( 111 ) = 2^i@, Also, note that Oj > 0 when pij ^ p 2 j or a ^ / cr 2 {j)- 
Now, we give an estimator of Oj ( j = 1 ,...,p) by 


X 2 j n2 ) + ( X\jn\ 2-2^712)^ T S 2 n 2 (j) 


2 s 


2 n 2 (j) 


2 S 


- 1 . 


IniC?) 


Then, we have the following result. 

Theorem 5.1. Assume (A-iii). Assume also n m ’ n log p = o(l). Then, we have that as 
p —> 00 

max |§j - 0j\ = Op{(n“- n logp) 1/2 }. 

J=ij—>p 

Let D = {j j 9j > 0 for j = 1, ...,p} and p* = ffD, where ffS denotes the number of 
elements in a set S. Let £ = (n~J n logp) 1//2 . We select a set of significant variables by 


D = {j | Oj > C for j = 1 , ...,p}, 


(5.1) 


where 7 € (0,1) is a chosen constant. Then, from Theorem 15.11 we have the following 
result. 


Corollary 5.1. Assume (A-iii) and n^^ogp = o( 1). Assume also lirninfp^oo 0 3 > 0 for 
all j £ D. Then, we have that P(D = D) — > 1 as p —>• 00 . 

Remark 5. As fori (> 3 )-class classification, one may consider 6j such as Oj = — 

Xi'j ni ,) 2 + s ini (j)}/{k(k - l)s i/n .,(j)} - 1 for j = 1, ...,p. 
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Now, we consider a classifier using only the variables in D. We define the classifier by 


y-1 


W i( S ini(d) )FS = 


E 

jeD 


(X0j Xjjru) 2 _ 1_ . I 

Sim(j) n i 


(5.2) 


for i = 1,2. We consider the classification rule by (11.31) with (15.21) . We call this feature 
selected DQDA “FS-DQDA”. Let us write that x^k = (a^fc, ■■■■, Xij Pt k) T for all i, k, where 
D = {j i,- , jp,}. Let Xh* = Yax(xi*k) for i = 1,2 ( k = 1 Then, from Theorem 12.II 

and Corollary 15.11 we have the following result. 

Corollary 5.2. Assume (A-i) and (A-iii). Assume also A max (Xh*) = o(p*) for i = 1,2, 
and liminfp^oo 6j > 0 for all j £ D. Then, for the classification rule by 111. A) with \5.2\) . 
we have H2.2I) under n“( n logp = o(l). 

By comparing Corollary 15.21 with 14.31 note that the condition “n.~; n logp = o(l)” 
is much milder than (14.4|) . Thus we recommend FS-DQDA more than DQDA-bc (or 
the original DQDA). For a choice of 7 £ (0,1) in (15.11) . we recommend applying cross- 
validation procedures or choosing a constant such as 7 = 0.5 because Corollary 15.21 is 
claimed for any 7 € (0,1). In addition, we emphasize that the computational cost of 
FS-DQDA is quite low even when p > 10, 000. 


5.2 Quadratic classifier by sparse inverse covariance matrix estimation 


We con si der ap plying a sp arse estimation of inverse covariance matrices to classification. 

-1 


Bicke l and Levinal (l2008bT ) gave a sparse estimator of 1 . Let cq( st ) be the ( s,t ) element 


of Sj for s,t = 1 ( i = 1,2). A sparsity measure of Xh ( i = 1,2) is given by c P) ^ = 

maxi<t<p X)s =1 \ a i{st)\ hi for 0 < hi < 1, where 0° is defined to be 0. Note that A max (Sj) < 
Mc Pt hi for some constant M > 0. If c p ^i is much smaller than p for a constant hi £ [0,1), 
XI,; is considere d as spar se in the sense that many elements of X), are very small. See 


Section 3 in Shao et al. (201lj) for the details. Let /(•) be the indicator function. A 


thresholding operator is defined by T t {M) = [m s tl(\m s t\ > r)] for any r > 0 and any 
symmetric matrix M = \m s f . Let r ni = M'fnJ 1 logp) 1 / 2 for some constant M' > 0. 
Then, Bickel and Levinal (120081)1 ) gave the following result. 


Theorem 5.2. Assume (A-iii), n- 1 logp = o(l) and liminfp^oo A n 
sufficiently large M'(> 0), it holds that as p 


^Sj) > 0. For a 


00 


II {Tr^Sim )}- 1 - Sr 1 !! = 0 P {c pM (n^\ogp)^/ 2 ). 


Remark 6 . Theorem \5.2\ is obtained by Theorem 1 and Section 2.3 in Bi ck el a nd Levina 

mM). 


We use Ai = {T Tn .(Si ni )} 1 as an estimator of 1 and consider the classifier by 
Wi({T T (S ini )} Q. By combining Theorem 15.21 and Proposition 14.11 if it holds that 
A(XIj) £ (0, 00 ) as p —> 00 and 


P c P,hi( n i llo gP ) (1 hi)/ 2 


A 


min(/V) 


— Op(l), 


(5.3) 
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the classification rule by (11.31) with Wj({T Tri . ( Si Ui )} Q has (12.21) under some regularity con- 



the constrained i \-minimization for inverse matrix estimation (CLIME). One may apply 
the CLIME to the classification rule by (11.31) . However, one should note that the compu¬ 
tational cost for the sparse estimation of S^s is extremely high even when p ~ 1,000. 
It is quite unrealistic to apply the estimation to classification when p is very high as 
P > 10,000. Also, the sparsity condition “A(XQ E (0, oo) as p —> oo” is quite severe 
for high-dimensional data. In actual data anal yses, we oftenjmcounter the situation that 
A ij —> oo as p —> oo for the first several j s. See Yata and Aoshima (2013) for the details. 


5.3 Simulation 

We used computer simulations to compare the performance of the classifiers: DBDA by 
(12.41) . GQDA by (14.21) . DLDA-bc by (14.51) . DQDA-bc by (14.31) and FS-DQDA by (15.21) . We 
did not compare the classifiers with the one given by sparse estimation of Sl i ~ 1 s such as 
Wi{{T r (S ini )} Q in Section 5.2 because the computational cost of the sparse estimation 
is very high when p is large. Thus we considered the classifier by ()2.7|) instead of using 
the sparse estimation, provided that XjS were known. We set 7 = 0.5 in (15.11) . We 
considered p * = [p 1 / 2 ]. We generated Xik — Hii k = 1,2,..., (i = 1,2) independently 
from (i) N p ( 0, X!*) or (ii) a p -variate t-distribution, t p ( 0, El,;, v) with mean zero, covariance 
matrix Sj and degrees of freedom v. We set p = 2 s , s = 3,..., 10 for (i), and p = 500 
and v = 4s, s = 1,...,8 for (ii). We set Hi = 0 , /x 2 = (0,..., 0,1,..., 1) T whose last 
p* elements are 1 and Si = Si(0.3l* _ - ? l 1 /i )i?i, where B\ is defined in Section 1. Let 
B 2 = diag(l,..., 1, 2 1 / 2 ,..., 2 1 / 2 ) whose last p* diagonal elements are 2 1 / 2 . We considered 
four cases: 

(a) ni = 10, n 2 = 20 and S 2 = Si for (i) N p ( 0 , Sj); 

(b) ni = [(logp) 2 ], n 2 = 2m and S 2 = Si for (i) N p ( 0, Sj); 

(c) m = [(logp) 2 ], n 2 = 2m and S 2 = B 2 ^iB 2 for (i) IVp(0,Sj); 

and (d) m = |"(logp) 2 ], n 2 = 2n\ and S 2 = B 2 Y,iB 2 for (ii) t p (0, Sj,z/). 

It holds that n.”j n log p = o(l) for (b), (c) and (d), liminfp^oo A m ; n /p* > 0 for (a) to (d), 
and liminfp^oo |tr(Si) — tr(S 2 )|/p* > 0 for (c) and (d). Similar to Section 1, we calculated 
the average error rate, e, by 2000 replications and plotted the results in Figure 0] (a) to 

(d) ' 

We observed from (a) in Figurc0]that DBDA and GQDA give preferable performances 
when ms are fixed. DLDA-bc gave a moderate performance because Si = S 2 . However, 
the other classifiers did not give preferable performances when p is large. This is prob¬ 
ably due to the consistency property of those classifiers (except (12.71) 1 which is claimed 
under at least ^vjnlog P = °(1)- Actually, as for (b), the other classifiers gave moderate 
performances because n ~( n logp = o(l). Thus we do not recommend to use quadratic 
classifiers including all the elements (or the diagonal elements) of sample covariance ma¬ 
trices, such as DQDA-bc and FS-DQDA, when the condition “n~J n logp = o(l)” is not 
satisfied. When n~j n logp / o(l) or ms are fixed, we recommend to use DBDA and 
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(c) Si + s 2 


(d) Si 7 ^ S 2 (non-Gaussian case) 


Figure 4: The average error rates of the classifiers: A: DBDA, B: GQDA, C: DLDA-bc, 
D: DQDA-bc, E: FS-DQDA, and F: the classifier by (12.711 . 


GQDA. On the other hand, FS-DQDA gave a good performance for (c) as p increases be¬ 
cause the difference of the covariance matrices becomes large as p increases. We note that 
from Corollary 15.21 FS-DQDA holds the consistency property for (c). However, DQDA-bc 
did not give a preferable performance because A m,•„ ('///) = 0(p 1 / 2 ), so that DQDA-bc 
does not hold the consistency property from Corollary 14.31 We note that Si / S 2 but 
A(/)/(5j(/) ~ Afor (c). Thus GQDA gave a similar performance to DBDA for 
(c). As for (d), DBDA gave a preferable performance even when v is small because DBDA 
holds the consistency property without (A-i). The other classifiers did not give preferable 
performances when v is small. However, these classifiers gave moderate performances 
when v becomes large because t p { 0 , S,, v) =>• N p ( 0 , S,) as v —> oo. Especially, FS-DQDA 
gave a good performance when v is not small. This is probably because FS-DQDA has 
smaller variance by feature selection, such as p*/p 0 , compared to the other classifiers. 

Throughout the simulations, the classifier by (12.71) did not give preferable performances 
in spite that HjS are known. See Section 3.2 for theoretical reasons. Therefore, it is likely 
that the classifier by Wj({T Tn . (S^)}" 1 ) gives poor performances for the high-dimensional 
settings. 


6 Example: Leukemia data sets 

We first analyzed gene expression data given by Golub et al. ( 19991 ) in which the data set 
consists of 7129 (= p) genes and 72 samples. We had 2 classes of leukemia subtypes, that 
is, 7 Ti: acute lymphoblastic leukemia (ALL) (47 samples) and 7 t 2 : acute myeloid leukemia 
(AML) (25 samples). The data set consisted of two sets as 38 training samples (ALL: 
27 samples and AML: 11 samples) and 34 test samples (ALL: 20 samples and AML: 14 
samples). Note that Si ni ^) = S , 2 n2 (d) if each sample has unit variance. Thus we did not 
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standardize each sample so as to have unit variance. 

First, we checked several sparsity conditions. We standardized each sample by x ^ 
/{ELi tr (5'in i )/(2p )} 1/2 for all i,k, so that tr(S'i ni )/2 + tr( 5 , 2 n 2 )/2 = p- By using all the 
samples (i.e., 72 samples), we calculated that 


A( 7 ) = 2060 (= 0.289p), 


( 6 . 1 ) 


where A^ is given in Section 4.2. Note that E( A(n) = 11/^. 12 11 2 • From this observation, 
we concluded that /r 12 is non-sparse. Next, we considered an estimator of 11S 12 1= 
)CiUtr(£rl^ 2tr(S 1 S 2 ) by A s = Y^=\ W ir H ~ ‘&r(S lni S 2 n 2 ) having W ini s defined by 


^1= 1 ~ |A- 1' -1- 1 - A -^ 

(16) in Aoshima and Yal a ( 20141 ) 

-E'C^s) = 11^12 


2 F . We calculated that 


Here, Wi ni is an unbiased estimator of tr(57;f), so that 


A s = 9.77 x 10 5 (= 137p). 


From this observation, we concluded that Si 2 is non-sparse. Therefore, the Bayes error 
rates of this data set are probably close to 0. Also, we calculated 


(A max (S 1 ),A max (E 2 )) = (1223,1457) (= (0.172p,0.204p)), 


( 6 . 2 ) 


where A max (S, ) is an estimate of the largest eigenvalue due to the noise-reduction method¬ 
ology by Yata and Aoshima ( 2013i ). We concluded that “A(Ej) € (0, 00 ) as p —» 00 ” does 
not hold and S*s are non-sparse because A max (Sj)s are very large. Therefore, we do not 
recommend to apply the classifier by sparse estimation of IT -1 , such as Wi({T Tn . (S’mJ} -1 )- 


Actually, we did not use any classifiers by sparse estimation of S ” 1 in this section. Also, 
note that the computational cost for the sparse estimation of X ) -1 is very high when p is 
large. 

We constructed the classifiers: DBDA, GQDA, DLDA-bc, DQDA-bc and FS-DQDA, 
by using the training samples of sizes n\ = 27 and n 2 = 11, and checked the accuracy by 
using the test samples from each 77 . Throughout this section, we set 7 = 0.5 in (15.11) for FS- 
DQDA. We compared_the classifiers with the hard-margin linear support vector machine 
(HM-LSVM). See Vapnic ( 1999 ) for the details. Note that the data sets are linearly 
separable by a hyperplane because p > ni+n 2 . We emphasize that the computational cost 
of DBDA, GQDA, DLDA-bc, DQDA-bc or FS-DQDA is as low as HM-LSVM even when 
p > 10,000. We summarized misclassification rates in the first block of Table [1] We note 
that n m i n = 11 and n ~^ log p = 0.81, so that “ri min log p = o(l)” does not hold. That is 
probably the reason why DLDA-bc, DQDA-bc and FS-DQDA seem to lose the consistency 
property. See Sections 4 and 5 for the details. On the other hand, DBDA and GQDA 
gave reasonable performances even when ms are small and seem to hold the consistency 
property. We calculated tr(S’i ni )/tr(S , 2n2 ) = 0.989 and (Aj( 7 /-)tr(<Sy n , )/p)/A^ ~ 1 for 
i 7 ^ i'. The difference of the trace of the covariance matrices is small and this is probably 
the reason why DBDA gave a preferable performance. See Section 4.2 for the de tails . In 
addition, HM-LSVM also gave a preferable performance. See Hall. M arron and Neemanl 
( 20051 ) for the consistency property of HM-LSVM. For this data set, Cai and Liul (l201lT ) 
summarized misclassification rates fo r sev eral oth er c lassifiers including a sparse linear 
classifier called LPD. See Table 6 in Cai and Liu ( 201ll ) for the performances of the other 
classifiers. Note that LPD has the Bayes error rates asymptotically under several sparsity 
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Table 1: Error rates of the classifiers for samples from 

Golub et al. 

(1999) 

Classifier 

DBDA GQDA DLDA-bc DQDA-bc FS-DQDA HM-LSVM 

Error rate 

Test samples (ALL: 20 and AML: 14) 

1/34 1/34 5/34 2/34 3/34 1/34 

Error rate 

LOOCV of samples (ALL: 47 and AML: 25) 

3/72 6/72 11/72 1/72 0/72 2/72 


Table 2: Estimates of (|IM 12 11 2 ? ||Si 2 |||.) by (AAs) for Armstrong et al. ( 20021 )). 


Case 

(a) ALL and MLL 

(b) ALL and AML 

(c) MLL and AML 

IIM 12 II 2 

4076 (= 0.324p) 

15050 (= 1.2p) 

8546 (= 0.679p) 

||Eia||J. 

1.12 x 10 8 (= 8863p) 

5.49 x 10® (= 436p) 

1.16 x 10 s (= 9192p) 


conditions. We observed that DBDA and GQDA gave the same accuracy as LPD. This is 
probably because the sparsity conditions do not hold for this data set, so that the Bayes 
error rates are almost 0. However, the computational cost for DBDA and GQDA is much 
lower than LPD. 

Next, by using all the samples (i.e., 72 samples), we checked the accuracy of the clas¬ 
sifiers by the leave-one-out cross-validation (LOOCV). We summarized misclassification 
rates in the second block of Table |T] We note that n m i n = 24 and ?Vi n logp = 0.37 or 


Bn 


= 25 and n min logp = 0.35 in this case, so that n min logp is a little small. We observed 
that DQDA-bc and FS-DQDA give preferable performances. On the other hand, DLDA- 
bc gave a poor perform ance bec ause it does not draw information about heteroscedasticity. 
For other classifiers, Tan et al. (2005) summarized resul ts of the LOOCV for thi s data set. 

Finally, we analyzed gene expression data given by Armstrong et ah (2002) in which 
the data set consists of 12582 (= p ) genes and 72 samples. We had 3 classes of leukemia 
subtypes: acute lymphoblastic leukemia (ALL: 24 samples), mixed-lineage leukemia (MLL: 
20 samples), and acute myeloid leukemia (AML: 28 samples). We considered three 
cases: (a) ALL and MLL, (b) ALL and AML, and (c) MLL and AML. We standard¬ 
ized each sample by tr(^ n J/(3p)} 1 / 2 for all i,k, as before. Then, we cal¬ 

culated (A(/), As) for the three cases. We summarized (A(/), As)s in Table [2j From 
Table [2j we concluded that /x 12 and Xi 2 are non-sparse for (a) to (c). Also, by using 
Amax(S,:), we estimated the largest eigenvalues as 1896, 3206 and 2101 for ALL, MLL 
and AML, respectively. From this observation, we concluded that F^s are non-sparse. 
We estimated tr(S^ ax )/(n min A 2 ^) and A max /A (/) by C± = max{W ini , W 2n2 }/(n min A^ 7) ) 

and C 2 = max{A max (Si), A max (S 2 )}/A( 7 ) in (C-i’) and (C-ii’). Then, we had (C\. C 2 ) 
as (0.362, 0.787) for (a), (0.001, 0.14) for (b), and (0.082, 0.375) for (c). Note that 
liminfp^oo > 0 and liminfp^oo A min (//j)/A(j) > 0. From these observa¬ 

tions, it is likely that the classifiers by (I) to (III) satisfy (C-i’) and (C-ii’) specially for 
(b) and hold the consistency property in (|2.2I) from Proposition 12.11 

Based on all the samples, we checked the accuracy of the classifiers by using the 
LOOCV for (a) to (c). We checked the accuracy for 3-class classification as well by 
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Table 3: Error rates of the classifiers for samples from Armstrong et al. ( 20021 ). 


Classifier DBDA GQDA DLDA-bc DQDA-bc FS-DQDA HM-LSVM 


Error rate 


Error rate 


Error rate 


Error rate 


1/44 


1/52 


4/48 


5/72 


LOOCV of samples from (a) ALL: 24 and MLL: 20 
2/44 6/44 1/44 0/44 0/44 

LOOCV of samples from (b) ALL: 24 and AML: 28 
1/52 1/52 0/52 0/52 0/52 

LOOCV of samples from (c) MLL: 20 and AML: 28 
4/48 1/48 3/48 3/48 3/48 

LOOCV of samples from ALL: 24, MLL: 20 and AML: 28 
6/72 7/72 4/72 2/72 3/72 


using the multiclass classification rule given in Remark [T| In the 3-class classification, we 
used 9j given in Remark [J] for FS-DQDA and used the one-versus-one approach for HM- 
LSVM. We summarized misclassihcation rates in Table [3j We observed that FS-DQDA 
gives excellent performances. HM-LSVM also gave reasonable performances, however, it 
does n ot dra w information about the difference of the covariance matrices. See Section 
2.2 in Aoshima and Yata ( 20141 ) for such an example. As for (b), all the classifiers gave 
preferable performances. This is probably because the classifiers by (I) to (III) satisfy 
(C-i’) and (C-ii’) for (b). 


7 Concluding remarks 

In this paper, we considered high-dimensional quadratic classifiers in non-sparse settings. 
The classifier based on the Mahalanobis distance does not always give a preferable per¬ 
formance even when n m j n —> oo and 77 s are assumed Gaussian, having known covariance 
matrices. See Sections 1 and 3. We emphasize that the quadratic classifiers proposed in 
this paper draw information about heterogeneity effectively through both the differences 
of mean vectors and covariance matrices. See Section 3.4 for the details. If the difference is 
not sufficiently large, we recommend to use the linear classifiers, DBDA and DLDA-bc (or 
the original DLDA). They are quite flexible about the conditions to claim the consistency 
property. See Sections 4.2 and 4.3 for the details. We emphasize that DLDA-bc, DQDA-bc 
and FS-DQDA can hold the consistency property under at least n ~[ n logp = o(l). Thus we 
do not recommend to use the classifiers when n m ’ n logp ^ o(l). In such cases, one should 
use DBDA and GQDA because they hold the consistency property even when 77 s are fixed. 
See Section 4.2 about the choice between DBDA and GQDA. When re“j n logp = o(l), we 
recommend DQDA-bc and FS-DQDA. Especially, FS-DQDA can claim the consistency 
property even when n m i n /p -A 0 and A m i n is not sufficiently large. See Section 5.1 for 
the details. For a choice of 7 G (0,1) in (15.11) . one may apply cross-validation procedures 
or simply choose as 7 = 0.5. Actually, FS-DQDA with 7 = 0.5 gave preferable per¬ 
formances throughout our simulations and real data analyses. On the other hand, even 
when n“[ n logp = o(l), we do not recommend to use classifiers by the sparse estimation 
of X” 1 unless (1) the eigenvalues are bounded in the sense that A(XQ G (0, 00 ) as p —y 00 , 
and (2) 5k;s are sparse in the sense that many elements of XjS are very small. We em- 
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phasize that “A max (Xj)s are bounded” is a strict condition since the eigenvalues should 
dep end on p and it is p robable that Ay -> oo as p -> oo for the first several j s. See 
Yata and Aoshima ( 2013h for the details. Also, the computational cost of the classifiers 


by the sparse estimation is terribly high. 

In conclusion, we hope we have given simpler classifiers which will be more effective in 
the real world analysis of high-dimensional data. 


Appendix A 

We give proofs of the theorems. For proofs of the corollaries and the propositions, see 
Appendix B. 

Proof of Theorem 2.1. We consider the case when xq € vri. Under (C-i) and (C-ii), it 
holds that for i = 1,2 

Var{(* 0 - Vi) T Ai(x ini - nf)} = tr(S*A* Xu Af)/m = o( A 2 ) 

and Var{(* 0 - Mi - x 2ri2 + ^ 2 ) T A 2 pL l2 } = M 12 Ai(£i + S 2 /n 2 )A 2 M 12 = o(A?) (A.l) 

from the fact that 

H i 2 -A 2 ^ 2 -A 2 Mi 2 A M 12 A. 2 Mi 2 / ^max(A 2 /, ”S 2 A 2 ) < Aitr{(XI 2 A 2 ) 2 }4/ 2 = o(n 2 A 2 ) 

under (C-i). Note that (®m i -Mi) T A(*m i -Mi)-tr(A i Sf irii )/n i = Y,k^k'( x ik~l J ’i) TA i( x ik'~ 
Hi) / { n i( n i — 1)}. Then, under (C-i) it follows that for i = 1,2 

Var{(a ini - mJ T Ai(x ini - mJ - tr(A;S ini )/nJ = 0[tr{(SjAj) 2 }/n 2 ] = o(A 2 ). (A.2) 

Then, by using Chebyshev’s inequality, from (1A.1D and (IA.2D . we find that 

W 2 (A 2 ) - W\{Ai) = tr[{(* 0 - Mi)(*o - Hif - }(A 2 - AJ] + A 1 + o P (A^. (A.3) 

Here, under (A-i) and (C-iii), it follows that 

Var(tr[{(* 0 - Hi)( x Q ~ Hif ~ £1 }(A 2 - Ai)]) = 0(tr[{E x (A 2 - A 1 )} 2 ]) = o(A 2 ). 

(A.4) 

Thus by combining (|A.3D with (IA.4I) . under (A-i) and (C-i) to (C-iii), we obtain that 
{W 2 {A 2 ) - Wi(Ai)}/Ai = 1 + o P (l), so that P{IU 2 (A 2 ) - Wi(Ai) > 0} -»• 1. When 
x o G 7r 2 , we have the same arguments. The proof is completed. □ 

Proof of Theorem 3.1. Note that tr{(£jAj) 2 }/n 2 = o(<5 2 ), i = 1,2. Then, similar to (1A.1I) 
to (1A.4[) . under (A-i) and (C-iv) to (C-vi), we have that as m —> 00 

WV(A') Wj(Aj) Aj 2{ x q Mj) {Ai(xi ni nf) A^ix^ n „ Hi 1 }} Y op(<5j) 

(A.5) 

when xq G 7 r, {i' i). Note that 2uji/5i —> 1 as m —> 00 for i = 1, 2, under (C-vi), 

where a;* = {tr{(EjA i ) 2 }/ni + tr(SjApSj/Aj/)/np} 1/2 (*' ^ i ) in view of Lemma B.l of 
Appendix B. Then, by combining Lemma B.l with (IA.5I) . we conclude the results. □ 
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Proof of Theorem 3.2. Similar to (IA.5j) . under (A-i), (C-iv) and (C-v), we have that as 
m —> oo 


W V {A V ) - Wi(Ai) - A i = 2 {xq - n i ) T {Aifxim ~ Pi) 

— AitfXi'n., — Hji + ( — 1) ? Ml 2 )} + Op(Si) (A. 6 ) 

when xq € 7 Tj ( i' 7 ^ i). Then, by combining Lemma B.2 of Appendix B with (IA. 6 D . we 
conclude the results. □ 

Proof of Theorem 5.1. By using (B.23) and (B.24) in Appendix B, we claim the result. □ 


Appendix B 

Throughout, we consider the eigen-decomposition of A- L by A, = for 

i = 1,2, where =diag(A il ( J 4 ),..., \i p (A)) having eigenvalues such as AjiM) > > 

\ p (A) > 0 and H { i A \ = [h^^,..., h ip ^\ is an orthogonal matrix of the correspond¬ 
ing eigenvectors. Let be the j-tli diagonal element of Aj for j = 1 ,...,p (i = 

1,2). Let xi k = A{ (x xk - /i x ) and x 2k = A 1 A 2 (x 2k - A* 2 ) for k = 1, n*. Let 

Si = A} /2 SiA} / 2 , S 2 = A/ 1/2 A 2 S 2 A 2 A/ 1/2 , fi = [7 h,...,7i 51 ] = A] /2 T 1 and f 2 = 
[721>--)72 q 2 ] = A/ 1/2 A 2 r 2 . Note that Var(»„■) = fjff = T,J=i lijllj = %, i = 1,2. 
Let Bi = Ai-Ai for i = 1,2. Let x oijk = x ijk - Pij for j = 1, ...,p (i = 1,2; k = 1, 

Proof of Proposition 1.1. We can write that tr(A~ 1 A i /) = Yl 1 j=i^ l Jj(A) A i'hij(A) /\j(A)- 

Note that Y?j=i h Ij(A) A i' h ij(A) = tr(A^) and Y?j=i( h lj(A) A i' h ij{A) ~ \'j(A )) < 0 for any 
t £ {1, ...,p}. Then, by noting that A^u^n > • • • > A ip ( A ) > 0, we have that 


tr(A- 1 A i >) = 


A v 


i'l(A) 

*hlL4) 


h T 

n il(A) 


A i/hi 1(A) Aj/i(yi) h'ij(A) A i l h'ij{A) 


> 


X i'j{A ) 
“ X ij(A) 


E 


y-v ^ l ij(A) A i'^ l ij(A) 


+E 

1=2 


Kj(A) 


K'j(A) ^ijiA) A i'^ l ij{A) 


1=1 


A 


i 2 (A) 


+ E 

1=3 


Hj(A) 


P \ 


—' ^ l ij(A) A i'^ l ij(A) X i'j(A) 


A ., 


j =1 

Thus, when tr{Ej(A,;/ — A*)} = tr(A,^ 1 A,/) — p, it holds that 


X i'j(A) 
“ X ij (A) 


E 


p 

A. > E< A ‘ ' j(A)/Kj(A ) - 1 + log(A i:? (^ 4 )/A i / J ( j4 ))} > 0 
1=1 


(B.7) 


from the fact that c — 1 + log c 1 > 0 for any positive constant c. Note that Aij^) 7 ^ 
X 2j(A) or hfj(A) A i'hij(A) < \'j(A) f° r some j when Ai 7 ^ A 2 . Since c — 1 + logc -1 > 0 
when c / I , it holds that Aj > 0 when A ij(A) 7 ^ X 2j(A) for some j. From (IB. 71) . if 
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h Ij(A) A i' h ij(A) < \'j{A) for some j, it follows that tr(A i 1 A v ) > YJj=\{K'j(A)/\j(A)), so 
that Aj > 0. When /i 1 ^ /x 2 , it holds that A,; > n^Ap/ji 12 > 0- Hence, it concludes the 
results. □ 

Proof of Proposition 2.1. We note that 

\a < ^Ap^X^Al/^iA 1 / 2 ) < A,A max (4 /2 ^Af) (B.8) 

and tr(HiA^Ap) < tr{(S i A i /) 2 } 1/2 tr{(S i / A,,) 2 } 112 . 

When limsupp^oo Ajp^) < 00 , i = 1, 2, it holds that 

A m ax(A^/ 2 A ? '/ 2 ) < AjiA max (A i /) = A*]Aj/,^) = O(Aji) and (B.9) 

tr{(S;A//) 2 } < tr(S;A;/S;)A;/ 1 ( J 4 ) < tr(5] 2 )A 2 1 (^ = 0{tr(5] 2 )} 

for all 1,1'. By combining (IB.8D with (IB.9|) . (C-i’) and (C-ii 5 ) imply (C-i) and (C-ii). 

Next, for (C-iii), it holds that tr[{Sj(Ai — A 2 )} 2 ] < A*itr{(Ai — A 2 )Sj(Ai — A 2 )}. 
When AjS are diagonal matrices such as A* = diag(o i ( 1 ), a^), i = 1,2, it holds 

that A i > Y7j=i{ a i'(j)/ a i(j) - 1 - log K'(?)/“*(:/))} and tr {( A i ~ A 2 )Si(Ai - A 2 )} = 

~ a 2 (i)) 2 - Note that G (0, 00 ) as p —>• 00 for all i,j, under A (Af) G 
(0, 00 ) as p —> 00 for * = 1, 2. By Taylor expansion, we claim that 

(j)/®i(j) 1 log(dj'(j)/ttj(j)) ^ ® 2 (j)) /(2 niax{ 1,^^/®i(j)})• 

Then, it follows that /L^=i a i(j)( a i(j) — a 2(j)) 2 = O(Aj) because € (0, 00 ) as p —> 00 
for all i,j. Thus we have that tr[{£j(Ai — A 2 )} 2 ] = O(AjAji). It concludes the results. □ 


Proofs of Corollaries 2.1 and 2.2. From Theorem 2.1 and Proposition 2.1, we can claim 
Corollaries 2.1 and 2.2 straightforwardly. □ 

Proof of Proposition 2.2. We first consider the case when liminfp^oo Y%=i Wj/\'j~^\/v > 
0. When e.\j < \X t] /Xjij — 1| < C 2 j for some constants c±j (> 0) and C 2 j (< 00 ), by Taylor 
expansion, it holds that 


A ij / Ap j 


1 - log(Ajj/Aj/j) > 


(Kj/K'j - 1) 2 

2 max{l,A?-/A?, J .} 


> 


c\j | Xjj j Xjfj 1 | 

2{c2j + l) 2 


When Xij/Xi'j —> 00 as p —> 00 , it holds that for sufficiently large p 


Xij/Xi'j 1 log( Xjj/Xj>j ) > |Ajj/Aj'j 11/2. 

Thus, when liminfp^oo ^j =1 ~ 1|/.P > 0, it follows that liminfp^oo A^jy^/p > 

liminfp^oo ~ 1 ~ log (\j/Xi>j)}/p > 0 from (IB.71) . 

Next, we consider the case when liminfp_ > . 0O |tr(Sj5]i7 1 )/p — 1| >0. We note that 
tr(SjS i v 1 ) > Y7j=\ from (|B.7p . When tr(S,S^ 1 )/(^ =1 Xij/X Vj ) -A 1 as p -A 00 , 

it holds that liminfp^oo | Ysj=i(^ij/^i'j)/p— 11 > 0 under liminfp^oo |tr(SjSi7 1 )/p— 1| > 
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0. It follows that liminfp^oo Aj(/y)/p > 0 from the fact that Y^j= 1 I Ay /A i'j - 1| /p > 

I X)j=i ~ 1|- On the other hand, we note that 

p p 

A i(iv) > tr^S:; 1 ) - p - ^2 log(A^/A^) > tr^S" 1 ) - 

i = 1 i =1 

because /^i'j - 1 - log^/A*/,,)} > 0. Thus, when Jf)j=i(^ij/^i'j)/P -1^0 

as p —> oo and liminf J ,_ > . 00 {tr(5]j5]^7 1 )/(^^ =1 Ajj/Aj/j)} > 1, we have that liminfp^oo 
A i(iv)/P > 0- Hence, it concludes the results. □ 

Proof of Proposition 3.1. Under limsup^^ Aji^) < oo for i = 1, 2, we have that tr{(Sj/ Ay) 2 } 
0{tr(S 2 ,)} and 

tr{(E iA^A,) 2 } = tr{(Ej /2 A,E,A,Sj /2 ) 2 } 

< A max (S l 1/2 A i S i A J S t 1/2 )tr(S l 1/2 A z 5] i A J 5] t 1/2 ) 

< A max (s| /2 AfsJ /2 )An5 2 n; = 0(AaAn5 2 n z ); and 

^Ai'^iApn 12 < ||/i 12 || 2 A max (A i /I] i A i /) = 0(||/i 12 || 2 A n ) for l = i,i'. 

Then, when limsup^^ Aji(^) < oo, i = 1,2, (C-iv’) and (C-vi’) imply (C-iv) and (C-vi), 
respectively. Similar to Proof of Proposition 2.1, we can claim the result for (C-v’) from 
tr{(Ai — A 2 ) 2 } = ^j=i( a i(j) ~ a 2 (i)) 2 when A*s are diagonal matrices. The proof is 
completed. □ 

Lemma B.l. Let Ui = {fr{(SjAj) 2 }/nj + ir(SjAj'Sj/ Ay) /ny} 1 ! 2 for i = 1,2 (i' ^ i). 
Then, under (A-i), (C-iv) and (C-vi), we have that 

(x 0 - /j, i ) T {Ai(x ini - Hi) - Ay (xy n ., - Hi')}/Ui =4> N( 0, 1) as m -> oo 

when xq G 7r* for i = 1,2 (i! ^ i). 

Proof of Lemma B.l. We consider the case when i = 1 (i' = 2) and x$ G 7Ti. Let *o = 

A 1 / (*0 — Mi). Then, it holds that Var(so|*o € 7Ti) = Var(ccifc) = Si. Let 

v k = xlxik/innwi), k = 1,..., ni, and v ni+k =-XQX 2 k/(n 2 uJi), k = l,...,n 2 . 

Note that E(vj() = 1 and 

n 1 +n 2 

v/c = {xo ~ Mi) r {-4-i(*im - Mi) - A 2 (x 2ri2 - M 2 )}/wi. 

k= 1 


Then, it holds that E(v k \v k -i ■ ...,ui) = 0 forjc = 2, ..., n 
martingale central limit theorem given by McLeishl (197- 


+ n 2 . We consider applying the 
Under (A-i), we can write that 
Xu = TiMu and x 2 i = T 2 yoi- Then, in a way similar to the equations (23) and (24) in 




Aoshima and 


mo x 2 i = i 2 
Ym a (120141 1. 


we can evaluate that under (A-i) 


(n s a;i) 4 U(^) =3tr(SiS s ) 2 + 0[tr{(SiS s ) 2 }] and (B.10) 

(n s n s t) 2 u(E(vlvl,) = tr(SiS s )tr(SiS s /) + 2tr(SiS s SiS s /) 

+ 0[{tr(SiS s SiS s )tr(SiS s ,SiS s ,)} 1/2 ] (B.ll) 
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for k / k! , where s = 1 for k E s = 2 for k € [ni + + 7 ^ 2 ] 5 s' = 1 

for A/ € [1, and s' = 2 for k' 6 [ni + 1, ...,ni + 772 ]. Note that tr(Sj) < tr(S^) 2 

and tr{(SiS 2 ) 2 } < tr(SiIl 2 ) 2 - Then, by using Chebyshev’s inequality and Schwarz’s 
inequality, from (IB.101) . under (A-i), we have that for Lindeberg’s condition 

E > r)i < E = 0 

k =1 fc=l 


-1-tr^1^2; / n 2 


as m —> 00 for any r > 0, where /(•) is the indicator function. Since 2ui/5\ = 1 + o(l) 
under (C-vi), we note that 


t£(gt) 

n?w 4 


0, 


tr^fla) 2 } _ tr{('Z 1 A 2 '£ 2 A 2 ) 2 } 


1^1 


2, ,4 


nsw 


2 uy l 


2, ,4 




0, 


' 2 UJ 1 


and 


tr(S a S 2 ) < tr^O^trKExEa ) 2 } 1 / 2 


nin 2 u}\ 


nin 2 u)\ 


under (C-iv). Then, by using Chebyshev’s inequality, from (IB.101) and (IB.Ill) , under (A-i), 
(C-iv) and (C-vi), we have that for any r > 0 



n 1+112 

E v k~ 1 


k =1 



= O 


tr (Si)/nf + tr(E4) 1 / 2 tr{(EiE 2 ) 2 } 1/2 /(ni?r 2 ) + tr{(£i£ 2 ) 2 }/r?| 




+ o ( l ) -)-0 


as m —> 00 , so that Ylk =4 2 1,2 = 1 + op(l). Hence, by using the martingale central limit 
theorem, we obtain that vk =>■ -ZV(0,1) as m —> 00 under (A-i), (C-iv) and (C-vi). 

Hence, we conclude the result when i = 1. For the case when i = 2, we can have the same 
arguments. The proof is completed. □ 


Lemma B.2. Under (A-ii), (C-iv) and (C-vii), we have that 

2(x 0 - Hi) T {Ai(x ini - - Ai>{xi' n ., - Hi, + (-1 )'7H 2 )}M => ^(0, !) 

as m ^ 00 when xq € 7Tj for i = 1,2 (i' 7 ^ i). 

Proof of Lemma B.2. We consider the case when i = 1 (i' = 2) and a?o E 7Ti. Let 030-/74 = 
ru/ 0 and y 0 = (y 0 i, ...,yo qi ) T ■ Under (A-ii), y 0s , s = are independent. Let 

_-I jr\ 

xin t = YJk=i Xik/ni, l = 1,2, fi = A l A 2 /7i2 and 

w s = 2yoj~/i s (xi ni - x 2n2 + fi)/5i, s = 1 , 

Note that q± > p, E{w s ) = 0, s = l,...,</i, E( w s) = 1 an d 


91 

E w s = 2 (*o - /7i) T {Ai(®i ni - /7 X ) - A 2 (x 2n2 ~ l* 2 ~ Pu)}/^ 1 - 

s= 1 
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Also, note that E(w s \w s -i, ..., w\) = 0 for s = under (A-ii). We consider ap¬ 

plying the martingale central limit theorem. Let M[ s = E(yf k ) for all l,s. Note that 
limsupp^oo \Mi s \ < oo for all l,s, under (A-ii) because liirisupp^^ E(yf sk ) < oo. Then, 
by using Schwarz’s inequality and the arithmetic mean-geometric mean inequality, we can 
evaluate that under (A-ii) 

= {1 + o(l)}7i’ s Ei7 la 7i’ t Si7 lt /n? + 0{(^ s tij lt /m) 2 }; and 
<n 

< { E (lls^lni) 4 } 1/2 {E{jitXi ni JitP-) 2 } 1/2 

= 0{7L^7 ls (7n^7nM) 1/2 |7nAlM} 

= 0[7i s iJi7 ls {7iiSi7 lt /n/ + (7 uA) 2 }M1 
= 0[{^ s tn ls /m} 2 + {7i t Si7 lt/ni} 2 + (7nA) 4 ], 1 = !> 2 


for all s,t. Then, we have that for all s,f 

<5i£(™ 4 ) = o\ S ^{il s %l ls /ni} 2 + (tL/*)^ 


Z=l 

„ 2 „„ 2 \ 


and 


(B.12) 


(tfi/2) 


4 E^WgWf ) T 


lis ( E ^ i/ n i + Am T ) 7 i, 7 it ( E ^zM + AA T ) 7 it 


z=i 


Z=i 


E (yo s yot) 

2 qi 

2 E(-!) ,+1 X]{(7fs7z«) 2 7n7z«7it + (o^J^TzuTfjMW™ 2 


z=i 


u=l 


+ O 


[ E 7uSi7i s 7f t E t 7 M /nf + O [ ^(tL^TuM) 2 
L z=i L 1=1 


Here, under (C-iv), we can evaluate that 
91 9; 


X] = Yrfu^ilulTJZiVMlu/n 2 


S,t= 1 1£=1 


lt=l 


= O 


= o 


U= 1 


<?Z 


-T vi 1 / 2 ! I ~T ■ 




u=l 


= O [{A T ^iA + tr(S 1 S ; )}tr{(S 1 S i ) 2 } 1/2 M 2 l = o(5f), 2 = 1,2 


(B.13) 


(B.14) 


from the fact that ELl(7fa^l7z J 2 < E ^i^ Sn; ,,') 2 = tr{(£i£,) 2 } = o^ifdf) 
under (C-iv). Then, by combining (IB. 121) and (1BT3|) with (|B.14p . under (A-ii), (C-iv) and 
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(C-vii), for any r > 0, we have that as m —> oo 


E( w s) 

s =1 


o 


ELtr{(Si^) 2 }K + E^i(7LA) 


4, 





s=l 



< 


2^s,t=i 


EjiVsWt) - 1 
r 2 


r 

L S=1 


and 

+ o(l) —y 0, 


so that E{w 2 s I{w 2 s > r)} < J2l=i ^( W s)/ T 0 and X)s=i w s = l+op(l)- Hence, by 

using the martingale central limit theorem, we obtain that YH= l w s AT(0,1) as rn —> oo 


under (A-ii), (C-iv) and (C-vii). We conclude the result when i = 1. For the case when 
i = 2, we can have the same arguments. The proof is completed. □ 

Proofs of Corollaries 3.1 and 3.2. From Theorems 3.1 and 3.2 and Proposition 3.1, we 
can claim Corollaries 3.1 and 3.2 straightforwardly. □ 

Lemma B.3. Assume that when xq € 7r* for i = 1,2 

tr[{{x o - Hi)(x o - Hi) T - y E i }(B 1 - B 2 )\ = o p (k)- (B.15) 

tr{Ei(Bi - B 2 )} - log | AiAf 1 ] + log | A 2 Af l \ = o p (k); and (B.16) 

{2{x 0 - nf) + (-l) l+ Vi 2 } T # t 'Mi 2 = op(k) (i' ± i ) (B.17) 

and (p/n] /2 )\\Bi\\ = o P (n), 1 = 1,2, 

where k = A m i n or k = <5 m ; n • Then, (f.l) holds. 

Proof of Lemma B.3. We consider the case when xq G tt\. We have that 


Wi(Ai) - W\(A\) - W 2 (A 2 ) + W 2 (A 2 ) 

= tr[{(* 0 - Mi)(*o - Mi) T - - S 2 )] 

+ tr{Ei(B! - B 2 )j - log | A x Af l \ + log \A 2 Af 1 \ 

2 

+ 1 )^ 1 tr[{2(xo - Mi - ( x ini - Mi)/2)(Mi ~ *in;) T - 

i=i 

Note that tr(S lnj ) = 0 P (p), || x ini - Mill 2 < ||*tn, - Mill 2 + IlMz ~ Mill 2 = II Pi ~ Mill 2 + 
0 P {p/m) and ||aJo - Mi - (*Jn, - Mi)/ 2 || 2 < ||*o ~ Mill 2 + ll^in, - Mzl| 2 + IlMi - Mill 2 = 
Op{p), l = 1,2, from the facts that i£(||*o — Mill 2 ) = tr(Ei), E{tr(Si ni )} = tr(Sj), 
E(\\xi ni - Mill 2 ) = tr(S,)/fi,, tr(Sj) = 0(p), i = 1,2, and ||mi 2 || 2 = O(p). Then, we have 
that for l = 1,2 

|tr[{2(* 0 - Mi - (*Zn; - Mi)/ 2 )(Mi - *Zn,) T - Si ni /ni}Bi]\ 

< 2||aio - Mi - (®lnj - Mi)/ 2|| • ||®Zn, - Mill • ll-B/ll + tr(Sin,)||Bi||/»ii 

= Op{(p/n | 1/2 )||B I ||}. 
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Also, we have that |(*2n 2 ~ M 2) 1 B 2 M 12 I = Op{(p/n^ 2 )\\B 2 \\- Thus it holds that 
2 

^(-l) /+i tr[{2(* 0 - Mi - (xint ~ Mi)/2)(Mi - xi ni ) T - Si n Jni}Bi ] 

1=1 

= —{2(* 0 - Ml) + Mi2} T ^2Mi2 + Op{(p/nl /2 )\\Bi\\ + {p/ny 2 )\\B 2 \\}. 

Hence, it concludes the result when xq G -k\. For the case when xq G 7 r 2 , we can have the 
same arguments. The proof is completed. □ 


Proofs of Propositions f.l and f.2. We consider the case when xq G -k\. Similar to Proof 
of Lemma B.3, we can claim that |{2 (£Eo — /x 1 ) + /i 12 } 2 B 2 M 12 I — l|2(*o — Mi) + M 12 II ' 
IIM 12 IMI-B 2 II = Op(p 1/2 ||Mi 2 I HI-^ 2 ||) = Op(p||B 2 ||) because |IM 12 I| 2 = °(p) a nd ||2( a? 0 — 
/x x ) + M 12 II 2 = Op{p). Thus, (IB . 1 7(1 holds under (C-viii) or (C-ix). Note that (IB.151) and 
(IB. 1611 naturally hold when A\ = A 2 and A\ = A 2 . Hence, from Lemma B.3, it concludes 
the result of Proposition 4.2 when xq G tti. 

Next, we consider (IB. 151) and the first term of (IB.161) . We have that for l = 1, 2 

|tr(Sifl,)l < tr(Ei)||B,|| = Op(p||£z||) and 

|tr[{(*o - Mi)(*o - Mi) T - Si}B z ]| 

< ll^o -MiI| 2 H^|| +tr(Ei)||B||| = 0 P (p\\Bi\\). 

Finally, we consider log | A; A l 1 |, l = 1 , 2, in (IB.161) . Let e p be an arbitrary (random) p- 
vector such that ||e p || = 1. Note that \\e^A l ly/2 || G (0, 00 ) asp—» 00 under A (A;) G (0,oo) 
as p —> 00 . Thus we have that 

ejA- 1/2 B z A- 1/2 e p = ejA“ 1/2 A,Af 1/2 e p - 1 = Op(||B,||), 

so that A min (A- 1/2 A;A- 1/2 )-l = O p (||B,|I) and X max (Af 1/2 A l A~ 1/2 ) -1 = O p (||B,|I)- 
Hence, under ||B;|| = o P (l), it holds that for l = 1,2 

loglA^- 1 ! = log|A” 1/2 A z Ar 1/2 | = 0 P (p\\B l \\). 

Note that A m i n = 0(p) and 5 m i n = 0(p) under A (Af) G (0, 00 ) as p —> 00 for i = 1,2. 
Then, under (C-viii), it holds that ||B;|| = o P (l) for l = 1,2. Hence, (C-viii) implies 
(1B.15|) and (IB.16D . It concludes the result of Proposition 4.1 when xq G tt\. For the case 
when xq G 712 , we can have the same arguments. The proof is completed. □ 


Proof of Corollary f.l. Under (A-i) we have that Var{tr(S , mi )} = 0(tr(Ef)/nj), i = 1,2, 
so that tr(/Sj ni ) = tr(Hj)-|-Op{(tr(Sf)/ni) 1//2 }. Then, under (C-i’) it holds that tr (Si ni ) = 
tr(Ej) + op(A min(/J) ) = tr(Ej){l + op(l)} and /(mp 2 ) = o(A^ in(JJ) /p 2 ) = o(l) for 

i = 1,2 because A min (/j) = 0{p). Thus, we have that under (A-i) and (C-i’) 


B t 


||{p/tr(S ini ) -p/tr(£j)}I p 


p\ti{S irii ) - tr(Sj)| 
tr(5i n Jtr(Ej) 


Op{(tr(S 2 )/n i ) 1/2 /tr(S' ini )} = op{A min(JJ) /p} = o P (l), 


(B.18) 


so that p| |-Bi|| = op(A min (/j)). Note that A max (Aj) = A min (Aj) = tr(Ej)/p G (0,oo) as 
p —> 00 . Thus, from Corollary 2.1 and Proposition 4.1, it concludes the result. □ 
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Proof of Corollary f.2. We consider the case when x$ € 7Tj. Note that tr(<S) nz )/tr(5];) 

= 1 + Op{(tr(Ef)/n ; ) 1/2 M = 1 + op(l), l = 1,2, and tr{(® 0 - nf)(x 0 - /xj r - X,;} = 
Op(tr(S 2 ) 1 / 2 ) under (A-i). Also, note that tr(E?)tr(Sf) < A i iA i jtr(S i )tr(Ei) = o{n min 6 ^ n ^ TT ^p 2 ), 
l = 1,2 under (C-iv’). Then, from (IB.18I) . it holds that for 1 = 1,2 

tr[{(* 0 - fii)(x o - m) T ~ 
tr(S z ) - tr (Si ni ) 


= y tr(S,)tr (Sj *i(*o-*)(*o-n) ~ S i} 

= Op{(tr(X 2 )tr(Xf)/n ; ) 1/2 /p} = op(5 min(//) ), and 
Pll^/||/^/ 1/2 = Op{tr(S 2 ) 1/2 /n z } = op (5 min (//)) 


(B.19) 


under (A-i) and (C-iv’). Similarly, from (IB.18I) . under (A-i) and (C-iv’), we have that for 

a / i 

{2(x 0 - + (- l) l+l n 12 } T Bi,n 12 

= Op{(n{ 2 'EiH 12 /n i ') 1/2 } + Op{(tr(S 2 ,)/ni/) 1 / 2 ||Ai 12 || 2 /p} 

= Op{(Aa||Mi2l| 2 M , ) 1/2 } + °p{( A i , illM 12 || 2 M , ) 1/2 } = o P (S min{n) ) 

from the facts that fx ^ 2 Sj/u 12 < AjiH/^ll 2 ) tr(S 2 ) = O(Aj'ip) and ||/i 12 || 2 = 0(p). On 
the other hand, under (A-i) and (C-iv’), from (IB.18I) . we have that for l = 1,2 

log{tr(Sj)/tr(S iri! )} = (tr(Sj)/tr(5j nj ) - 1) + Op{(tr(S i )/tr(5 /rii ) - l) 2 } 

= (tr(S;)/tr(S; n; ) - 1) + Op{tr(S 2 )/(n;p 2 )} 

= (tr(S ; )/tr(5 in; ) - 1) + o P (5 min(II) /p) 

from the facts that tr(X 2 )/p = 0( An) and tr(S/)/tr(5; n( ) = l + op(l). Then, under (A-i) 
and (C-iv’), it holds that 

tr (SiBi) - loglAjAP 1 ! = p(tr(Ei)/tr(S ini ) - 1) - plog{tr(H i )/tr(S , i ni )} = op(J min(//) ). 
Similarly, under (A-i) and (C-iv’), we have that 

tr(SjSj/) - log \ApAff 1 1 = p(tr(Sj)/tr(Sj/) - l)(tr(S i /)/tr(5 i / n . / ) - 1) + op(5 min(// )) 

= Op(|tr(S i )/tr(S i #) - l|(tr(S 2 ,)/nj/) 1 / 2 ) + o P (5 min(II) ). 

(B.20) 

By combining (IB. 191) to (IB. 201) with Lemma B.3 and Corollary 3.1, we can claim the 
result. □ 

Proof of Corollary f.3. We can write that 

Sini(j) P'i'S oirii(j) / (j^i 1) n i( x ijrii f^ij) / (P'i l)i (B.21) 

where s oin . (i ) = Y%Li( x ijk ~ Hij) 2 /rH- Note that limsupp^^ E{exp(tij\(xi jk - Pij) 2 - 
suPp-j.oo[Li{exp (tij|Xijk - fMj \ 2 + exp < oo under 
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1 /2 

(A-iii). Then, under (A-iii), for any x satisfying x —> oo and x = o(n- ) as n* —>• oo, we 
have that as n* —>■ oo 


PK 1/2 |s, 


2 

o»,(i) - CTi(i)I> s) = exp ( - y{l + o 


Refer to Chapter 6 in Ide la Pena. Lai and Shat] ( 20091 ) for the details of this result. Let 
T\j = M(Vi(j)n^ 1 logp) 1 / 2 for j = 1, ...,p, where M > 2 1 / 2 . Then, under n^ 1 logp = o(l), 
it holds that as p 


oo 


p p 

^2 P (\SoiniU) - ^(1)1 ^ r L') = X] jP ( ri * 1/2 | So ^iO') “ <T i{j)\/ r k[j) M{\ogp) 1/2 ) 

1=i l=i 

= E exp (- "^{1 + »(!)}) -*0. (B.22) 

l=i 

Next, we consider the second term of (IB. 211) . Let u^j = tij{ a i(j)/ r li(j))^ 2 for j = 1, ...,p. 
Then, we have that for j = 1, ...,p 


1 /2 

E{exp(uij | x oij k |)} 

1/2 1/2 
= L;{exp(u ii |x oii fc|/o- i(: / j) )/(|x 0 ijfc| < 1)} + E{e-x.Y>{ui j \x 0i j k \/a i ^)I{\x 0 ijk\ > 1)} 

< expfuij/o-J^) + ^{exp^x 2 ^ Ja p2 )} < exp^/a^) + ^{exp^x^^/r?^ 2 )}, 

1 /2 

so that limsupp^QQ E{exp(uij\x 0 ijk\/^ i ^)} < oo under (A-iii). Thus, in a way similar to 
{EH, we have that 

p p 

Y P (fei ~ tMj I > T 2 j) = Y p ( n i /2 |®<i**i - Miil/ojy) > M(logp) 1/2 ) -»■ 0 (B.23) 

l=i l=i 


for T 2 j = 1 log p) 1 ^ 2 , j = 1, By combining (|B.22I) and (IB.231) with (IB.2111 . 

under n) -1 logp = o(l) and (A-iii), we have that 

p 

Y P i\ S irH(j) ~ n i a i{j)/( n i ~ X )l > M T lj + T 2 j) / (rii - 1)} 

1=1 

p 

— ^ ] P (l s oinj(j) — ®i{j) I “I - — /Rj| — T L T 2j) 

1=1 

p p 

< Y P (\ S oini(j) - Oi{j) I > T-lj) + X] P (fe; - Wll 2 > t|j) -»• 0. 

1=1 1=1 

Note that ih<R:(j)/( n i — 1) = +o(n i 1 ^ 2 ) and rfj = o(r\j) under r\ l logp = o(l). Thus 
we have that maxj = i j ... iP {|sj ni (j) ~ a i(j) II = Op(maxj = i i ... iP rij) under n^logp = o(l) and 
(A-iii), so that 

max {|s in .(j) - cr ?;(j) |} = 0 P {(n l “ i logp) 1/2 }. (B.24) 

J -L 5 • * * 5 P 
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Then, for i = 1,2, it holds that under 1 log p = o(l) 

= II ^i(d) — ^i(d) 11 = J ™ aX p 'fl S irn{j) ~ 7 i(j)\/i s ini(j) 7 i(j))} 

= Op{(nr 1 logp) 1/2 } = op (1). (B.25) 

Then, it follows that (C-i’) holds under (4.4). From the facts that = 0(p ), note 

that ^mi n logp = o(l) under (4.4). Then, by combining (IB.25D with Proposition 4.1 and 
Corollary 2.1, we can claim the result of Corollary 4.3. □ 

Proofs of Corollary f.f. First, note that s n ^-a^ = Y?i=i{ n i- l ){ s ini{j)- 7 i(j)) /(Xi=i 
2). From (IB.241) . we can claim that maxj = i ) ... >p {|s n (j) — o'fj)!} = OP {( n m ; n logp) 1 / 2 } under 
n min lo gP = °( 1 ) and (A-iii). Thus it follows that \\S~^ - S^|| = Op{(n“( n logp) 1/2 }. 
Note that A(j/p)/||/x 12 || J £ (0,oo) as p —> oo. Then, by combining Theorem 2.1 with 
Propositions 2.1 and 4.2, we can claim the result of Corollary 4.4. □ 

Proofs of Corollary f.5. Let S mrii = YlkLi( x ik ~ Pi)( x ik ~ l l i) T / n i and denote its (r, s) 
element by s oin . (rs) for r,s = 1 ,...,p. Let u i{rs) = mm{t ir / for r ’ s = 
1 ,...,p. Then, we have that for r, s = 1 ,...,p 

1 /2 

^{^-P(^i(rs) \%oirk%oisk &i(rs) 

— E[exp{lLi( j, s){,Xoi r k/^ "h %oisk "h ®i{rs) )/^(rs) 

< exp(u i(rs) a i(rs) /r/^ 2 s) )F;[exp{t ir x 2 irfc /(2^ ( / r 2 ) )}exp{^ s a; 2 isfc /(2?7^)}] 

< exp(u i(rs) CJj (rs ) / Vi[rs )) [£{ ex P (tirxlirk/vllr )) }^{exp(tisxLfcAty,)) }] ^, 

1 /2 

so that limsup p _ ) . 00 F;{exp(uj( rs )|x 0 i r feX 0 i s fc - a ifjs ) \/Vi[ rs ))} < oo under (A-iii). Note 

that Sj ni ( rs ) = TliSoi n i(r s)/iPi 1) f^iip'irni Mir)(®isrij [Hs)/(jH l)j where Sj ni ( rs ) 
is the (r, s) element of S tni . Also, note that Pi^ rs ) £ (0, oo) as p —> oo under (A-iii) 
and liminfp^oo r/j( rs ) > 0 for all r,s, from the fact that p.^ rs ) < {(??i( r ) + 7 p r ) ) ( r li(s) + 

a i(s) )} 1//2 . In a way similar to (111.221) and (IB.231) . under n- 1 logp = o(l), (A-iii) and 
lim infp^oo r/j( rs ) > 0 for all r,s, we have that 

p 

Y p {\ s ini(rs) - n i 7 i(rs)/( n i ~ 1)| > rii(T 1{rs) + T 2{rs) )/(m - 1)} 
r,s= 1 

P 

— ^ ^ {P(.\^oirii(rs) &i(rs) I — Tl(rs)) “1“ Mirll^isni Mis| ^ ^~2(rs))} 

r,s= 1 

P 

— ^ ^ Mir | T l^isn^ Msr| ^ ^~2(rs))} T ^(1) ^ 0 

r,s=l 

for T 1(rs ) = M{p i(rs) n~ l log p) 1/2 and r 2{rs) = M 2 {(cr i(r) + ^(s))^" 1 logp}, r,s = 1 ,...,p, 
where M > 2. Thus it holds that max r?5 =i v .. jP {|^ n .( rs ) — ov rs )|} = Op(max r5(S= i 5 ... 5P r 1 ( rs )) 
because 72 (rs) = (rs))? so that 

max {|s ini ( rs ) - 7 i(rs) } = Op{(^ _1 logp) 1/2 }. (B.26) 

r,s=l,...,p 
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Here, from the equations (Al) and (A2) in Bickel and Levina (|2008al ). we have that 
|| M || < max, 5 =i,...,p'^2t=i \ m st\ for any symmetric matrix M, where m s t is the (s,t) ele¬ 
ment of M. From (IB.261) . we have that 


- Sj|| = Op{p(n; 1 logp) 1/2 } = op(l) 


(B.27) 


under nf 1 p 2 logp = o( 1), (A-iii) and liminfp_ > . 0O > 0 for all r, s. Then, under A(Sj) E 
(0, oo) as p —>• oo, we can claim that X(Si ni ) E (0, oo) in probability. Thus it holds 
that ||epHr 1 || £ (0, oo) and ||eJST^|| E (0, oo) in probability, where e p is an arbitrary 
(random) p-vector such that ||e p || = 1. Then, from (IB.271) . we have that (S i Ui — 

^i) S 7n, e P = - S 7n)) e p = Opipinr 1 logp) 1 / 2 } under nf 1 p 2 logp = o(l), (A- 

iii) and liminfp-j.oo rji( rs -\ > 0 for all r, s, so that ||i?,;|| = Op{p(n)" 1 logp) 1 / 2 } = Op(l). 
Note that (C-i’) and (C-ii’) hold under the conditions of Corollary 4.5. Also, note that 
tr{(Ip — EjXp 1 ) 2 } = 0(p) (i 1 i ) under A(Ej) E (0, oo) as p —* oo. By combining 
Corollary 3.2 with Proposition 4.1, we can claim the result of Corollary 4.5. □ 


Proof of Corollary 5.1. By using Theorem 5.1, we can claim the result straightforwardly. 

□ 


Proof of Corollary 5.2. Let us write that for i = 1,2 

^ / *(Sj( ( i))pS' — /* ' { { x 0 j ~ x ijni) — s ini(j)/iPi{j) n i) T fog 

jeD 

Note that E{W i '(T,~^ d) ) FS } - EiWiCEf^ps} = A i(m) (i ! ^ {) when x 0 E vr p Also note 
that liminfp^oo A min ( 77 j)/p* > 0 under lim infp^oo 6j > 0 for all j E D. If A max (S,>) = 
o(p*), (C-i’) and (C-ii’) hold for E ? >, i = 1,2. Here, let us write that = diag(<Tj(j 1 ),..., )) 

and S i{d y = diag(s iri . (il ),...,s in . (ip j) for i = 1,2, where D = {ji,...., j p J. Then, 
in a way similar to ()B.25D . under n“ 1 logp = o(l) and (A-iii), it holds that HSbA* — 

S^J| = O P {(nf l logp) 1 / 2 }. Hence, we have that p*||5T,^ - E7 ( ^J| = o P ( A min(777) ) 
under lim infp^oo 9j > 0 for all j E D. By combining Corollary 5.1 with Propositions 2.1 
and 4.1, we can claim the result. □ 
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